 Before we start, I'd like to thank all of the CubeCon organizers for giving us the opportunity to share our story with you all. Let's get started. Quentin and I work on the security team at Yelp. Today, we are going to talk about how we utilize the open policy agent and active directory to provide fine-grained role-based access control to the Kubernetes infrastructure at Yelp. And let's start with the motivation for this work. If you are not familiar with Yelp, it's a company that connects people with great local businesses. As of at the end of 2020, Yelpers have contributed 224 million reviews on our platform. Approximately 31 million unique devices access Yelp every month on average. And more than half a million business locations spend money on Yelp ads every month to promote their businesses. And as you can imagine, it takes a lot of infrastructure to support an application of that scale. Today, we have more than 1,000 geographically distributed software engineers across hundreds of different teams. We have a containerized microservice architecture managed by an in-house built open-source platform as a service framework called PASTA. Under the hood, PASTA uses Kubernetes as the container orchestration framework to manage thousands of workloads. And we have a dozen of Kubernetes clusters that we run on EC2 instances, with several custom namespaces, where we run our microservices and batch jobs, and other types of statement workloads like Kassandra clusters, Kafka filling, Spark, and other many various use cases. But we weren't always using Kubernetes. In fact, it's a relatively new technology for us at Yelp. As early adapters to the whole containerized workload system, we had been using Mesos as our container orchestration framework since 2014. Our Mesos infrastructure was primarily used for running services of batch workloads in containers. And we didn't really have a strong need for fine-grained access control on the Mesos clusters, because only the infrastructure team needed to interact with the Mesos directly. And they literally used only shared secrets with administrative privileges for that access. Workload developers used an abstraction layer provided by the infrastructure team, and they really didn't need the direct access. And to be honest, they didn't even need to worry about the underlying technology at all. But over time, Kubernetes gained a lot of popularity in the community, and we eventually decided as a company that we should migrate over to Kubernetes for its plug-able components, extensibility, and its wide community support. The migration of Kubernetes unlocked a lot of new use cases for us beyond just services and batch workloads. And with this came a lot of interest from the other development teams with Yelp that wanted to leverage the Kubernetes infrastructure support, like other types of workloads such as Kafka, Cassandra, and various other use cases. These developers would need to interact with different namespaces and resources depending on their use case. Unfortunately, the security model that we used in the Mesos world was carried over to Kubernetes world. In which, in hindsight, it doesn't really make a lot of sense, because given all the new use cases and participants in the new ecosystem. And as you can imagine, this brought up a major problem, because it meant that all Kubernetes users got administrative access to all Kubernetes clusters via shared client certificate for the client admin role. And it didn't really matter how much permission they actually needed. They had the admin privilege and to the all clusters. And even worse, to access to the client certificate, these users had to be given pseudo access to the sensitive control plane nodes. As you can imagine, this has obvious disadvantages. Since everyone is using a shared set of credentials, it was more difficult to understand who did what in that cluster. And since every action appears to come from the same user. And as a related one, anyone could easily make a mistake and accidentally modify or delete resources in namespaces. Additionally, this complicated our compliance narratives, and we had to do strange things to overcompensate like creating completely separate bespoke Kubernetes clusters for sensitive workloads, which obviously doesn't scale. So with that picture, it was very clear for us that we needed to introduce fine grained role based access control. So let's go over our requirements and what we really needed to solve this access control problem in our large Kubernetes infrastructure. First, we wanted to authenticate individual users using an identity provider. Then we would like to define authorization rules for Kubernetes objects based on the team ownership, resource sensitivity, action sensitivity, and infinitely many other combinations of the custom taxonomies that exist in our infrastructure. Finally, we needed to maintain a formal paper trail for all changes to the authorization policies and all group memberships. For this project, we prioritize human users since there's a lot of them with lots of use cases. And thankfully, we only had a handful of service users and they each had their own client certificates and nonsense the role bindings and RBEC configurations. Therefore, the starting with the human users made a lot of sense for us. So now we covered the motivation of this project. And at this point, I will hand over this presentation to the question to deep dive into the technical details. Thank you, Charlie. At a high level, we have human users authenticate using their octa credentials. Kubernetes supports this basically out of the box. A human user runs a command like kubectl get pods. We provide a wrapper script that first prompts the user for their octa authentication, which consists of their active directory credentials and a second factor like Ubiqui or an authenticator app. Upon success, they receive a JWT token from octa valid for one hour, which is then sent to the cluster via kubectl's dash dash token parameter. The Kubernetes API server verifies the authenticity of the JWT token by verifying the signature using octa's public key server. The human user's identity from the token is then propagated downstream for authorization considerations. The benefits of using octa for authentication are pretty straightforward. Root access is no longer needed to interact with the Kubernetes clusters. Each action can be tied back to a specific user rather than a generic administrative user, and credentials obtained from octa are temporary and behind a second factor. This mitigates the risk of credential theft and replay attacks. Now let's talk about authorization. Here's an overview of the architecture. First of all, on the top, you've got the user who interacts with Kubernetes using kubectl. Kubernetes sends a request to Open Policy Agent, which makes the authorization decision, and on the bottom, we've got the data that feeds into that decision. We collect data from Git and from Active Directory, and then we put that data into an S3 bucket for Open Policy Agent to read. Let's talk about that data first. Here's a summary of all the data that feeds into OPPA in order for it to make authorization decisions. First, the access control capabilities are stored in an access restricted Git repository. These are the roles that we can give to a user to allow them to perform certain actions. We then use Active Directory to store group membership, which tells OPPA which users are able to use which capabilities. And then finally, we have service metadata that contains various information about the services that we run that we may want to use for authorization. Next, let's talk about the policy. The OPPA policy is the logic that OPPA uses to take all of the data provided and make a decision. It is written in rego, which is a language based on datalog. I will cover the details of the rego policy, but I'll cover specific examples of what the input looks like. The capability format contains a number of different ways in which you can limit access. It currently contains clusters, namespaces, resources, subresources, resource names, verbs, pod metadata, and service metadata. These are all native Kubernetes attributes, except for the service metadata. A capability can have any number of subcapabilities. Like in this example, we only have one called admin. These attributes are all structured as allow lists, where if you match any value you are allowed, which means that an empty list means allow all. Here's an example capability with two subcapabilities. The first allows you to run any command as long as the cluster is one of these two listed. And you can also run the list command, which sometimes looks like a git in QTTL, in any cluster. Therefore, these two subcapabilities combine to create an unprivileged capability. In this example, we're using pod metadata to filter. Yelp.com slash service name is just a custom metadata attribute that we use to represent the name of the service that a pod is a part of. This will work if either of the two values match. Here's another example. We're using a team attribute to limit actions to only those where we can match the action to a service owned by the infrasec team. You can also use a custom variable in the capabilities to represent the user's team rather than a static value. Here's a capability to let someone run read only commands for any service that their team owns. And here's the same capability, but without the read only requirement. Next, I'll cover the OPA policy manager. This is a service that we use to compile all of the info data that OPA needs, bundle it up into a format that it can read and push it to S3. It runs continuously and only updates the bundle when anything has changed. Now I'll cover the components that run on each Kubernetes host. OPA runs as a service on each house and is configured to pull from S3 to configure itself as well as to listen on a WebSocket for any requests that come from Kubernetes. The input it gets from Kubernetes takes a form of a subject access review and that contains all the information about the user's request. It then combines that with the data from the S3 bundle to make a decision. These decisions are then logged and written into Splunk. Now we'll go over just a couple of end-to-end examples of how this can be used. First, we'll cover a basic example. In the top one, a user is trying to list pods in the default namespace. The list verb matches the specified verb in the capability. The namespace and the resource don't have any restrictions in the capability. And the user is a member of the OPA Kubernetes Dev Unprivileged Group. So this request is allowed. In the bottom example, the verb doesn't match and the request is denied. In these next two examples, we are using a team-based capability which depends on the request matching with our service metadata. The subject access review only contains the name of the pod and so the first step is actually for OPA to request the pod metadata from Kubernetes. This pod metadata then contains the name of the service and that service can be matched against our service metadata. In this case, the service is owned by Infrasec team and the user making the request is also a member of that team so the request is allowed. In the bottom example, the service name is now PE service and so the owner no longer matches with the user's team and the request is denied. Note that we could still add a static value into the capability. If we added PE there, then the team would match again and the request would become allowed. Finally, I'm going to talk about the decision logs briefly. Each authorization request ends up with a log being recorded. This has, of course, the basic information like whether or not the result was allowed or denied. And also the various input in the subject access review that Kubernetes sent to OPA. But what's really neat about it is that it shows you which groups would have been allowed versus which groups the user had. This makes it very easy for us to debug requests and to figure out what capabilities we should add to someone who needs to do something. Next, I'm going to hand it over to Charlie to talk about rollout strategy and system reliability. Thanks, Quentin, for the great technical deep dive. Let's talk about our rollout strategy. Our major challenge was switching to a new authorization system under a heavy usage. To avoid any system or user disruption, we came up with the following approach. First, we ensured that all of the changes are rolled back safe. Then we configured our infrastructure to support a dry run mode. Then we rolled out the dry run mode incrementally, cluster by cluster. In this way, we were able to observe the actual usage patterns. Once we collected enough data, we provisioned least privileged authorization capabilities. Finally, we incrementally rolled out the enforcement mode that enforces the least privileged authorization. One thing that I would like to emphasize here is that on every step, we overly communicated the rollout. For infrastructure changes, we let the SRA teams know on each step. For authorization provisioning, we communicated with the stakeholders. In this way, we were able to similarly roll out our major authorization system in our infrastructure. Now let's talk about several challenges that we encountered throughout this project. One of the major challenges was that the subject access review did not contain enough information for us to make least privileged decisions. For example, we didn't have the service name that we used for the team-based authorization. As a solution for each authorization request, OPA reaches out to QBAPI server with the resource name and gets the resource labels including the service name. OPA also reads the environment variables from the host to make authorization decisions based on the Kubernetes cluster or YAP ecosystem. Next problem was that unprivileged engineers who has network access can curl to OPA API and modify the authorization policies. In this way, any unprivileged user could entirely bypass the authorization system. And this is really bad for our security posture. As a solution, we set up MTLS between Kubernetes and OPA. Additionally, we created an RBAP policy for OPA to only run GET commands in Kubernetes clusters. In this way, we aim to mitigate privilege escalation attacks through OPA. The next problem was that we encountered during the design process, which was that multiple teams owning services in a single namespace. In fact, one of our namespaces contained hundreds of services owned more than hundreds of teams. So we could have created a capability for each of these teams, but then we could have hundreds of different capabilities. Instead, we came up with a special keyword called my team and created a single team-based policy. In this way, everyone who has the same team-based policy can only access their own services. One thing I would like to mention here is that this couldn't be possible with the default Kubernetes RBAC policies, because RBAC policies do not provide such granularity. The next problem that I would like to talk about is how do we associate the team with non-pot resources with metadata. So we previously talked about how we use part metadata to associate a request with our own service metadata to make team-based access, but some non-pot resources require special treatment. For example, secrets that are bound to the services that they do not have team membership associated with them. To enforce these privileged team-based access to the secrets, we came up with a special case handling in the Rego policy. Literally, we say in our policy that our policy only allows team-based access for the secrets under the following conditions. The service should belong to the users team and all of the services instances should also belong to the users team. In this way, we can enforce team-based access for all types of resources. But as you can imagine, this came with a price. This approach made the Rego policy long and overly complicated. This brings us to our next problem, which is that our Rego policy becomes overly complicated and hard to test and debug. To make sure that we have enough coverage, we wrote extensive test cases, unit test coverage. In fact, we had unit test coverage more than 2,000 lines of code. So these were overall problems and challenges that we encountered throughout this project. Let's talk about a little bit on the system reliability. As you have seen from the new architecture presentation, we actually added more complexity under the hood than in the status quo. As such, it was very important for us, for our design and our role as strategy to ensure that the system was fault-tolerant to any failure scenarios. So let's go through some of these scenarios. First one is that what if we push a bad policy that has a catastrophic side-effects like blocking access from old users? We do enforce a strict code review process, but nobody's perfect after all, like some bad things might slip through. In case this happens, we have automated checks in our CICV pipeline that will prevent uploading the bad policy bundle into the S3 bucket. The next case that even with the automated tests, it is still possible that a bad policy might slip through. This or any other kind of issue with the open policy agent service could cause it not to respond and block people to authorize into the Kubernetes clusters. To mitigate that risk, we distribute administration keys to Kubernetes hoods and check created Kube CTL admin wrappers that only the system administrators can use. In the case of emergency, a handful of admins will still be able to access to the Kubernetes clusters using these admin keys. In our architecture, we rely on many different systems and we implemented our architecture in a fault-tolerant way. For example, if for any reason GitHub or Active Directory does not respond, we have the last frozen state of the word in the S3 bucket. And system will continue to work. And if for any reason S3 stops responding, we still have the last frozen state of the word cache in the OPA at the Kubernetes clusters. And users still continue to access Kubernetes clusters. So to conclude this talk, first I would like to talk about our shortcomings of the system. After all, nobody is perfect, right? First shortcoming is that unfortunately not every resource has meaningful metadata in our infrastructure. So for that reason, we cannot make a team-based access decision on every resources. Although these cases are rare, but still in the future we want to give ownership-based access to all resources to go all the way with the list privilege. To enforce all resources to have meaningful metadata, we are planning to use the admission controller. Another shortcoming is that the inconsistency in the authorized actors. We use OPA for human resources, which are the majority of our users, but we use RBAC for the service users. So in our roadmap, we are planning to migrate service users to OPA-based authorization. The next one is that the Octo authentication has one hour TTL. And to be honest, we do not consider this as a shortcoming. And in fact, from a security engineering perspective, this is actually good for our security posture. But we receive some complaints from the users that they have to enter their password every hour. The next two shortcomings that I'd like to talk about here is that related to keeping the system list privilege compliant. So after we implemented the new authorization architecture, we spent a lot of time to come up with list privilege compliant capabilities. And we believe we did a good job there. But unfortunately, we do not have a system that would constantly monitor the unused permissions and drop them from the users. Additionally, YADV is an extremely dynamic environment and people always get new access to new clusters all the time. And we do not have time limited access controls. Once a user gets the permission, they keep it anyways. And addressing these issues are already in our roadmap. So to conclude this talk with some of our learnings, I would like to first say that if you are making a fundamental shift with how you interact with a platform such as exposing more levels to people or expanding the surface area, do not just bluntly carry over the security model. It's important to rehabilitate the security model at a major system changes like this. Another thing that we learned was having a well-told true system design can make a smooth reviewing process for security teams to get shipments from the SRE teams. In fact, during our project, we had a very smooth reviewing process to implement such a large project. And since it's inception, the server, the system never malfunctioned to prevent people to access. Next is that it is totally okay to first build the tools that can support the list privilege access without actually doing the list privilege rules, because doing both can be really challenging. And after some time you can actually do the list privilege setup as a follow-up. At the end of this project, we update the ability to give list privilege access to Kubernetes classes based on the many parameters need and ownership. And as a result, we now have a stronger security posture by enforcing list privilege access for hundreds of teams to thousands of different Kubernetes resources. And this concludes our talk, and I would like to thank all of you for joining us today and feel happy to answer any questions. Thank you.