 Hi, everyone, and welcome to KubeCon. Today, we're going to talk about static analysis of Kubernetes manifest, Kubernetes YAMLs. We'll talk about how to identify missed configs and the process of doing that within a common organization. We'll talk about some security findings that we had while scanning the open source repositories containing Kubernetes manifests. Write policy as code and automate that as part of our CI-CD pipeline, and also on a running cluster. So thanks very much to KubeCon for having me. My name is Barak Shostar. I'm the CTO and co-founder of a startup named BridgeGroup that helps to solve exactly that, identifying these configs at infrastructures code manifests. I'm also an open source maintainer of a cloud security project named CheckOut that we're going to talk about today, and also contributor to other projects in that cloud security space. All of the slides are going to be available on our BridgeGroup log. And if you have any question or a joint project that you'd like to do on the open source, feel free to reach out either on Twitter or follow my GitHub account. So before diving into CheckOut and what does it do, let's give some background on a story of an engineer, in this case myself, having to maintain a Kubernetes cluster. So as an engineer, I want to move fast. And unlike people like to think, I really do not like to break things. And that means that I want my cluster to be running in a reliable manner and in a secure manner. And that also means that I probably have a love and hate relationship with ticketing systems. So the day-to-day activities of me as an engineer are containing the sprint planning where I have a predictable amount of tasks that I want to handle during the next two weeks. But I also have unpredictable tasks, which are security tasks that the security team is finding, whether on my production cluster or my dev environment. Here's an example of one. In Kubernetes, containers, user IDs should really not be shared with the host that they are running on. Because a root process might be shared and might leak from the container level into the host level, giving it permission to compromise the entire host and other workloads that are running within it. Another one is that I should really do not need to allow the container and the host to share a network namespace, hurting the isolation of the container itself and from the host. And another one on the networking space that I just got is not related to specifically the Kubernetes cluster, but to the infrastructure that is wrapping it. Over here, I have a security group. The firewall rule that is allowing SSH access from the entire world into my cluster, because my security groups that is wrapping it is allowing that. And basically what I'm saying is I have this sprint planning where I should have predictable amount of work that I gave story points to that I distributed within my team, but the security team, which is doing an honest job finding misconfig in my Kubernetes manifest and my Terraform infrastructure code is creating a lot of giras and is hurting my original planning. And this is where my story begins. I just had too much giras and too much, too low amount of resources. And I had to reprioritize my sprint over and over again within the sprint. And it caused me to a destruction from the plan features that as an engineering business, I want to deliver to my customer. So I tried to understand if others are having the same issue. Does open source projects have the same issues when they're creating Kubernetes clusters? And are they having the same issues when they're creating infrastructure as code modules? So I decided to scan the entire GitHub repos. And let's look at this data. Over here, we have a sample of that scan that I've performed on the open source repositories. In this scan, I've looked for all of the Kubernetes deployment manifest that I could find crawling GitHub, at least a sample of it. And I've identified several disturbing default configurations that should really not be there. And the reason for that that I'm saying that should not be there is because as an engineer, the thing that I do the most is hitting copy and paste from Stack Overflow or open source repositories and getting inspired by. And Kubernetes and other infrastructure as code frameworks have so many configurations that it is really easy to get lost and not identifying that some of those configuration might be a risk, while others are just an opinionated way to run a Kubernetes cluster. There is nothing wrong about it. Some of the configurations might be the right ones for the cluster. But I really wanted the engineering team to acknowledge each and every best practice and verify if that's possible on our deployment. So over here, we have some of the top misconficts that can be found on the public open source repositories. Over here, there are some disturbing ones. One is having a public endpoint accessible from the entire internet or not disabled. And others are more of auditing capabilities like have logging enabled by default, having secrets encrypted by default. And some others that are just best practices to have if you're running a Kubernetes cluster. So the thought that came to my mind is that infrastructures code manifest presents a risk because it's another configuration layer that people might do mistakes in. And like the juror tickets that we just saw, my engineering team and myself have done those mistakes, but they also present a new opportunity. Let's take a look on this Kubernetes manifest as an example. We have run as non-root configured to false. And over here, we have run as not root configured to true. But the thing about here on the left is we're running not as root at the container level, but we're overriding the pod configuration that told us to run as non-root. So on the container level, we're enabling root user to run. Overriding the pod specifications. So this is a bad configuration to have while this one looks like a fixed Kubernetes manifest, not a line root to run. So looking on those Kubernetes manifest, those different configurations frameworks, same for Kubernetes, Terraform, or anything other that I can configure in code, allowed us to create an open source project named checkup. So checkup is a static analysis open source tool that allows you to scan infrastructure as code manifests. It was released last December. It's under the Apache 2 license already has more than 50 contributors. So thank you, different contributors for making that project possible. Have more than 8,000 downloads, more than 1,400 stars. And it's basically really in Python. So it's very easily extended. So checkup statically analyzes for known best practices implemented across infrastructure as code manifests like the Kubernetes YAML that we're focused on today. But it is also doing a good job with Terraform, HCL files, CloudFormation YAMLs, Azure ARM templates, serverless framework, and can easily be extended to other infrastructure as code frameworks. Over here we have a sample output of checkup scanning an infrastructure as code directory. And we run that as a demo. So checkup essentially is a policy as code engine. What is policy as code? Is the ability to enforce the best practice in a manner that a policy can be version controlled meaning it can be go back and forth in time in terms of commit. You can create a version of your policies and create a policy bundle. A policy can be peer reviewed as part of a pull request. Specifically in checkup, we're utilizing the Python's capability to create inheritance between different policies. And the policy can also be a part of your development life cycle, meaning let's say that you're developing a new infrastructure and you want to enforce a company policy. For example, each cluster should have an owner team tag. You can automate that as part of your planning. When we are planning a new cluster or a new infrastructure in the cloud, we're also planning what are the policies that we would like to enforce in that cluster as part of the threat modeling that can be done on that infrastructure. And once you have this policy written, you can automate that as part of your CI pipeline scanning each and every infrastructure change with a tool like checkup. So what is a policy? Here's an example for one. A best practice in Kubernetes cluster is CPU limits should be set. I created a name to that policy, created an identifier to that policy. Set what kind of configurations I want to scan. And basically looked for resource, the spilled resources in the Kubernetes manifest, the limits and verified that CPU is being set. If it does, it is being set. The check is passing. If not, the check is failing. All right, so let's take a look on a live demo. So over here, what I have is a Kubernetes project with a Kubernetes deployment channel of a Jenkins server. I can see here specs and the tags and basically everything that I need to start a Jenkins cluster. What I'm going to do now is one, I can pip install checkup. Checkups prerequisites are that you will use Python 3. So if Python 3 is not the default Python version on your workstation, use pip3 install checkup. And I already have it installed over here. So what I'm going to run now is I want checkup. Let's take a look on how it looks like. I want checkup to scan my current directory containing the Kubernetes manifest. So checkup can scan different frameworks that we've talked about. CloudFormation, Terraform, Kubernetes, serverless, ARM. And we want it to scan all of those options even though that currently in the directory I have only Kubernetes. I can also choose to run and add to the run external checks meaning custom checks that are not part of the 400 policies that checkup has. Before diving in further to the demo, the thing that I would like to show you is the amount and the different scans that we currently have. So if you go into checkup.io, hit these scans and research scans, you can see the entire list of checks across different frameworks that it can scan. So over here I have a set of Kubernetes checks. They are scanning different types of the configuration itself. So if I'd like to create a custom checks just like the one that we've seen before it's just creating a Python file and putting it in a directory. I can also put it in a GitHub repository. That way the policy, the custom policy, the custom checks that I've created can be peer reviewed and can be version control. For that reason we have the external checks Git option. I can also ask checkup to run on the specific check ID not to show passing checks on the failing ones. And let's give it a try. And also integrate it with the Bridgeroot platform. So what I'm going to run now is checkup minus D meaning directory. And I wanted to scan my current directory with my Jenkins Kubernetes configuration. So I have it executed right here. It firstly tells me if I need to upgrade checkup to a newer version. So over here I have a new update available which I can upgrade using the following command. Let's keep it for now. And I can see the entire results of the Kubernetes manifest scan. I have 31 passing checks which is good meaning I did some good work from security perspective on my Kubernetes manifest. And I have 19 failing checks. So if I drill down into the failing ones over here I have, for example, second profile that is not set. So what I can do, I can open this URL which opens up a guidelines tab with description and rationale on why I need to do that and how does a fixed manifest should look like. So the thing that I should probably do is copy the following section and put it inside my Kubernetes manifest and then my test will pass, my checkup test will pass. So we have here a set of different results each time it will print the specific section inside the file that is not configured correctly and will guide me with a link to the virtual documentation of how to remediate that piece. Now what we're going to do is talk about another option that I can run. So obviously I can run checkup over and over again manually on my workstation and it will help me to fix, identify and fix, misconfig in the Kubernetes manifest before deploying them to GitHub before opening a pull request. But how can I automate that process? The thing that I can do and add is a pre-commit hook where I can try to do a change, the pre-commit will scan it and will either block or accept the commit locally on my workstation. And if it passes, if you go through GitHub, open a PR and start the deployment into my Kubernetes cluster. Let's see how does that look like. So I'm in the same directory and what I added right now is a pre-commit configuration. So the pre-commit hook for those who are not familiar with pre-commit is a technology that hooks into the commit to the git command of git commit and will scan all of the resources usually used with linters that are going to be committed into the GitHub repository. So over here I've configured check-off to be one of the hooks and I told check-off to run always on the current directory the directory that contains that code. So if I'll type in git status I can see that I have this new file that I've created. What I'm going to do now is to create a commit message trying to push, on my way to push this file into the GitHub repository. I'm going to write git commit minus M, my new Jenkins cluster. Let's give it a try. The thing that it will do now is it will run check-off and it will fail the commit. I cannot commit and it will show the results of check-off on the current directory. Now I cannot run a commit without passing all of those checks. But what if one of those issues that check-offs is reporting on is actually my opinion way of how my Kubernetes cluster should actually run. Either there is no risk or it is accepted risk or there is a justified reason to run Kubernetes in that manner. So check-off just like JUnit testing or PiTest or other testing framework supports skipping checks, skipping tests. So let's take a look on how does it look like. I'm here again on the check-off documentation. And actually the thing that you can do to skip a check is add a metadata annotation where you can skip a specific check ID and write the reason for that specific check. If you do that, it will be reported as skip check as part of the JUnit XML. If you're working with a JUnit XML plugin or a skip check on the bridge cruise or the check-off CLI report. If you do not want to go to a configuration by configuration and skip every individual check, you can exclude globally a specific check by using the following flag, which is skip check with a specific check ID. All right, so we have a pre-commit hook and now we cannot push into GitHub any change. But that actually means that every engineer should have pre-commit hook deployed on his workstation. Another way to validate that infrastructure as code manifests are answering those best practices is having check-off as part of your CI CD pipeline. Whenever a new change request is being submitted to GitHub or other version control system, you can run check-off as a CI job, just like you're running unit tests. So let's call it infrastructure security tests. And if the tests are passing, you can trigger a deployment trigger of your infrastructure into your cloud account. Now let's see how does that look like? What I have here is a fork of Kubernetes Goat. For those who are not familiar, it's an amazing project created by Madu. Thank you Madu for creating that piece of a vulnerable by design Kubernetes cluster. In this project, you'll find some educational material of how a bad cluster configuration looks like and how you can actually hack this cluster. What I've added inside my fork is a GitHub action workflow containing a check-off command over here scanning a specific directory. So what I'm going to do now is I'm going to go into that specific directory and I'm going to create a new file. The file that I'm going to create is actually the same one that I've used before of the Jenkins. So I'm just going to copy it from my local environment. What I'm doing that here is the same Jenkins deployment. I'm going to give it the same name. And what I'm going to do now is to create a new branch. I have here this new file and I'm probably going to see a GitHub action being triggered. So what the GitHub action will actually do is on every commit to a pull request that is going to master. It's going to take the check-off action, the latest version out of GitHub. So if you'll type in github-bree-true slash jack-off action, you'll have an example of how to run the check-off on your GitHub repository. So it's installing the check-off latest version, which takes about a minute. Then checks out the repository with all of the Kubernetes code code. And then having a set of failing checks on my deployment, on my Kubernetes deployment. So if I'll go back to the pull request, refresh it, I can see that the Kubernetes goat CI have failed on the check-off action. And if I was not an admin, it would have blocked me from merging that specific branch into the main branch. Meaning I cannot deploy bad configuration into production if I'm working with a GitOps operation. Cool. So we have a CI CD pipeline. And actually another phase that we can do is we can actually deploy check-off as another container that will pull out of the Kubernetes cluster a running configuration from Kubernetes API. And then we can have another phase of check-off running on my EKS cluster, checking the runtime configuration, verifying that there are no drifts from my fixed configuration in GitHub and my Kubernetes cluster. And they're routing me in Slack whenever there is a bad configuration on my production account. So actually misconfig analysis can be done in three places. Pre-commit, post-commit on a pull request and on a running cluster. That way we can be sure that the misconfig are not happening in my production environment. So the pre-commit hook and the pull request really helped me to develop in a faster pace cause I have a very fast feedback loop. I can automatically prevent from those misconfig to happen on my environment. And I don't need those amount of Gira tickets to hurt with my day-to-day business operations. So the key takeaways to have is to one, keep your manifest secure, have a fast feedback loop instead of a ton of Gira tickets, monitoring both build time resources, meaning the code itself and the running cluster, and have an ability to create new policies and version control them and review them. So that's all of it folks. Thank you very much for joining this talk. If you want to try Check-off, just type in Check-off in GitHub. If you have any question, please join our Slack channel at Slack at Bridge.io or send me an email. Thank you very much KubeCon for having me and have a great day everybody.