 My name is Manish Mehta. I am a security engineer at a little-known company called Netflix. And my projects, I've been there for about four years, and my projects involve secure bootstrapping, PKI, secrets management, authentication, and authorization. Authorization is what something we're going to talk today. I have co-presenter Toran. Hi, everybody. So my name is Toran Sandal. I'm the tech lead of the Open Policy Agent project, which we're going to talk about in this presentation. I've also contributed to Kubernetes and Istio. I love Golang and high-quality software. So take it away, Manish. All right, let's get started. So before we start talking about the main topic here, I want to just get some background definitions out of my way using an example. So let's say if I am trying to send a request to my bank and say, transfer $1,000 from account X to account Y. In this particular case, the bank is going to perform two steps. One, it is going to first verify the identity of the requester. That's me. That is what we call AuthN, authentication. And then verify that the requester, this identity, is authorized to perform the requested operation. That's authorization, AuthZ. Now, for some of you, it may be really obvious, but I cannot tell you how many times I get into conversations where people confuse these two things. And then the conversation goes nowhere. So hopefully, I start off with this background definitions. So we're going to talk about bullet number two, not one. Now, one more thing I would like to say is these two steps do not need to be tied together. They do not need to happen within one system. They could be completely decoupled. In fact, I will go one step further and say, if you tie them together sooner or later, you're going to lose your flexibility. If you have interest in that statement, you will meet me afterwards. I can go into deeper conversations there. So some more background about Netflix's architecture. So this is a very, very simplified, high-level view of Netflix's architecture. We have our customers. We have our backhand. We have our cloud provider partners. And then, of course, the CDN that basically stores your movies and shows that gets you the bytes as quick as possible to your TV. Now, we are going to focus on this big, empty box today, which is our backhand that runs all our control plane. What we have there is a CI CD pipeline container orchestration system, workflow management system, which are very similar to Kubernetes in many ways. And I think this morning, you probably caught Diane's keynote. She is the director who manages the team under Spinnaker. So these are all the systems that basically drive and launch all the applications and workloads. Then we have these applications that are basically some sort of API gateway, personalization, account management, key management, legal encoding of movies, and all those things. And then we also have some sort of bad jobs or periodic on-demand tasks which run in containers through our container management system called Titus. Now, we also have some internally hosted services, like for storage or real-time data streaming. And then we, of course, have employees or contractors who are responsible to bring these applications together and run them, maintain them. Now, this all looks simple and maybe not too different from your setup. But then things get challenging when this happens. They want to talk to each other. Of course, there are other interactions where applications go and talk to the cloud provider resources like storage. In case of AWS, it would be S3 or database or QA. But we are going to today focus only on the interactions within the control plane. So all these applications, all these services are hosted by us and controlled by us. Now, when they want to talk to each other, you want to make sure that they have an opportunity to decide who gets to talk to them at what level. So of course, as I said, we're not talking about authentication. A lot of people say when you have network reachability, that's all we need. You have network reachability. That means somebody is authorized to talk to you. Not really. So first of all, that's not authentication. And authorization is definitely not. So what you want to do is you want to go much granular level, not just the network reachability. But I'll give you an example. So what if one of these services is a REST-based service, then you want to control exactly who gets to call what REST endpoint. So let's define this kind of problem. I just give you an example of REST, but it doesn't mean anything because this is a very, very diverse back end where we have REST-based services, GRPC-based services. There are some services that have their own custom binary protocol. It has nothing to do with any standardization. So how do you solve Aussie problem in a world like that where you have such a diverse set of services, they have random different protocols that they use, random different resources that they host. And it's being called by people and services. So once you do that and you try to solve that problem, you have to first define that problem. And with this kind of diversity, it feels very general. So the only best thing I could do with that kind of problem at hand was come up with this definition. We need a simple way to define and enforce rules that read something like this. Identity I can or cannot perform operation O on resource R for all combinations of I, O, and R in your ecosystem. That sounds like boiling an ocean. However, this problem needs to be solved because if you have any subset of this I, O, and R, and then you have one solution for that, you're going to end up having nine solutions in your ecosystem and lose visibility and control completely. So that was not an option. We had to have one system that would take, if not 100%, like majority of your combinations of I, O, and R, that is identities, operations, and resources. Now, just before you start building something like this, you have to have your guiding principles and requirements in place. So we wanted to make sure that we write down all these things before we actually propose something. So first thing first. I don't know if you caught Diane's keynote today, but she did talk about how company culture impacts tech as in the solutions you build, and it sometimes goes the other way as well, where whatever you build also impacts the culture. But in this particular case, because this is an authorization system in a cloud-native environment where you want to make things self-serve because what we have at culture, like core of our culture, is something called freedom and responsibility, where all our engineers, all developers, all teams, are free to do whatever they want, whatever is best for their own service. Now, in this environment, when they have their ownership of their own service, they are also required to define who gets to talk to their service at what level. So if a solution is not giving them that kind of freedom, it's not gonna fly in a company like Netflix. So first thing first, we had to make sure that the solution works with the company's culture. Second resource type, as I mentioned, we don't have one resource type. We don't wanna just do a solution for rest services or GRPC services. So remember, I'm talking about random stuff here, not even rest and GRPC as in some sort of API calls. I'm talking SSH access too. So for example, if you have a VM and you need SSH access into a system, SSH becomes your resource, right? So it's not just the API call, it's SSH too. So identities, a lot of identity authorization system that you will see around, they're mostly RBAC and they are either LDAP based or some sort of AD based. The problem there is now you have to have accounts, most of those systems are designed for users. But here you have incoming identities that can be users and users can also be like full-time employees, contractors. And then you can have software which can be bad jobs, which can be containers running services or some of the VMs running services. So all these callers need to be identified and supported. Underline protocols. So as I said, it could be HTTP, GRPC, completely custom binary protocols, implementation languages. So freedom of responsibility again, where people are free to use whatever language they prefer. I mean, there could be religious war about this, where Java, Scala, Node, Ruby, Python, Rust. All right, latency. I think this is one of the requirements that I really had to think through and has actually big impact on the architecture that we ended up coming with. So think about a Kafka cluster, right? Which basically has a bunch of nodes and each node has 1000 requests per second. Now, if you go back to your queuing theory for a little bit, if your authorization decision on every request to put or get from a Kafka topic takes more than one millisecond, you are thrashing. You went over your service rate, right? That means your authorization decision has to be made in sub millisecond. Otherwise, you're not even serving, right? So in this particular case, can you even think about an authorization decision that requires a network round trip? You cannot. So some of these things you had to be considered. Flexibility of rules. I think this is where Torian will talk more about, but once you have all these resources today, you know your use cases today, but that doesn't mean you know and you can predict everything that is gonna come next week. So if your rule engine or if you're, the way you write your policies is hard-coded and does not actually allow you to write in a way that it feels more like a language, then you can really restrict yourself in future. So we wanted to make sure that the flexibility of rule is there. And the last one I call capture of intent. What I mean by this is basically when people are self-serve, they tend to make mistake. They're not malicious. They just didn't have their coffee. They think they did something, but that's not exactly what they actually ended up writing in policy. So is there any way to basically make sure that we give them the freedom but not enough rope to hang themselves? So this is what we came up with, where I think I would say at this point, we'll go one by one, but look at service A on the bottom left and service B on the bottom right. Service A is a VM that is running as application code and you see a little box called OZ agent. And on the right you have a pod which has application code and another container in the same pod which is authorization agent. So let's look at this architecture one by one, what happens? So here you have policy portal where engineers or developer team members go and write their own policies for their own services. It's a UI based system and they're able to create policies, delete policies, reorder the rules inside the policies. And then there are sometimes we have to give some override mechanisms to some critical teams like SecOps and Forensics and stuff like that. And all the policies are versioned and stored in the database. Now sometimes you have to write policies based on data that is not necessarily incoming from your request. It's from external data. Now for example, let's say if you had a raspberry service and you say slash admin slash anything is only accessible by owner of this app. Now in that particular case, you need to find out who the owner of a given app is. That mapping between app and app owner is coming from some external source. So in this case, it could be application ownership database, right? Or another example is like, I have this application and this application is only meant to be used by finance team. Okay, who's in the finance team? That information about user and finance team needs to come from somewhere else. Probably employee management database. So now you're writing all these policies and you need facts, source of truth for all this information and needs to come from somewhere else. So depending on in future how many different types of policies you write, you may fetch data from multiple sources. So we have a concept called aggregator whose job it is to basically fetch all this data from different sources and keep it fresh. Then there's a concept of distributor which basically pulls all the policies and related data from aggregator and keeps it hot. Now the difference between aggregator and distributor is fairly scalable because it keeps everything in memory. You can slice and dice it and put it in different, let's say cloud provider account for security and stuff like that. And then you have these distributors as the name says, starts distributing all these policies and relevant data to the authorization agents. Now what happens is the authorization agents are able to then asynchronously download all this information and keep it hot. So you see the red arrows right there are what I call hot path where the request comes into the application. It is going to the authorization agent and come back with the answer. Now I mentioned something about the latency. See here that we are not making a network round trip. The authorization agent is sitting on the, it's right there. In case of pod it's still probably like right next to each other. So you're not spending a real network round trip. Now if you zoom in little bit into the agent itself it has two parts like hot path and the asynchronous path. So hot path is the gray path where the application is making a request for authorization decision. Whatever request you to receive for whatever resource is going to path that information to the policy engine. You see here we are using open policy agents engine. Torian's going to talk more about it. And then we have a slow path or asynchronous path which is the blue path which is downloading all this information periodically from distributors. Now this is all like architecture and theory. So let's take a one concrete example in a familiar looking setup. So think about a very, very simple rest based payroll system and it basically has only two rest end points that it exposes. One get salary, second update salary. Now you want to write an authorization policy for this particular app. This is what you want to write. Employees can read their own salaries and then salaries of anybody who reports to them. So in this case let's say Bob reports to Alice. Now when Bob reports to Alice, Bob is able to get his own salary but Alice is able to get her own salary and then Bob's salary too. This is what you want to achieve. Then you want to have report generator bad job. Some bad job kicks off every, I don't know, weekly basis and write some sort of crunches, some numbers. You want to give that report generator app permission to read anybody's salary. So you want something like this. Get salary star. And then you have let's say a performance review app that is, I don't know, kicks off yearly, six monthly, whatever your company does and goes and updates that salary. Of course you don't want to give access to employees to write and post their own update salary. So you say, all right, only that application has access to the post API. At this point I'm going to hand it over to Turin who will explain how all this magic happens within OPA. Okay, thanks Manish. Okay, so Manish just gave a great overview of how Netflix is solving authorization at scale across their stack. And what I think really resonates for me and for a lot of us here today is that so many organizations are trying to solve authorization and policy enforcement at scale across all these different kinds of resource types and execution environments and languages and cloud providers and so on. Now what I also really like is this desire for a general purpose solution that solves for all of these different combinations in a holistic way across the stack. And so this is what we set out to do when we created the Open Policy Agent project. So the Open Policy Agent or OPA as we like to call it is an open source general purpose policy engine. What that means is that you can take OPA and you can apply it to any system at any layer of the stack. And what you get when you use OPA is this purpose-built engine that you can use to offload policy decisions to. So the idea or the way this would work is that say you're building this service that exposes an HTTP API, well you would take that service and you would integrate it with OPA to execute a query against OPA when it wants to enforce access controls over who can access or who can do what via the API. In that query you would supply a bunch of input like the method and the path and the headers and maybe the body and so on. And then OPA would take that input, that query and it would combine it with the policies and the data and so on. And it would evaluate all of that to produce an answer like allow or deny, which would then send back to your service so that it could be enforced. Now OPA itself is implemented in Go and it's designed to be as lightweight as possible. So you can take it and you can run it as a sidecar next to your application or you can run it as a host-level daemon or you can embed it directly into your application as a library just like Netflix is doing. Now I said it's lightweight and the reason for that is because basically all of the policies and data that OPA uses for evaluation are kept in memory. So it doesn't introduce any kind of runtime dependencies at deployment time. So it doesn't depend on an external database or an external service or anything like that. Everything's cached in memory. Now in addition to the core evaluation engine that OPA gives you, OPA also provides a suite of tooling that you can use to develop your policies locally. So it gives you like an interactive shell to experiment with and debug policies. It gives you a test framework to codify unit tests over your policies and so on. Now the core thing that OPA gives you though is this high-level declarative policy language. And we call that language Rego. And what Rego does is it gives you the ability to express policy as code. And so what that looks like when you use Rego is you write a bunch of rules in this declarative language and the rules exist to answer questions or make decisions like, can user X perform operation Y on resource Z? So what we thought we would do is step through this example that Minish set up and show how you would use OPA to enforce it. So the policy in English is fairly simple. It says that employees are allowed to read their own salary and then they can also read the salary of anybody who reports to them. So let's look at how we would actually use OPA to enforce this. So when you're using OPA to enforce policy, what you're mainly thinking about doing is writing rules that make decisions over some data. And the language that OPA gives you to do that is purpose-built for writing policy and reasoning over arbitrary data. And the reason for that is because when you're thinking about policy, what you're thinking about is data and logic. And so what you really want is a language that lets you focus on exactly that. And so that's what the language is purpose-built for. And so what we're gonna do is create a rule called allow. And that rule is gonna allow requests if the employee is trying to read their own salary. Now in order to make that decision of whether or not to allow the request or not, we're gonna need some data to make the decision over. And so the service is gonna provide some input and you can see an example of that on the left. So it provides the method and the path and then the authenticated user making the request. And then we're gonna have the rule use that data to make a decision. So you can understand this rule or you can read it as basically allow is true if the input dot method matches get and input dot path matches get salary ID and input dot user matches ID. Now the interesting thing about this example is that that ID value is actually a variable. And so that variable is gonna be bound when OPA evaluates the rule to a single value across all of those expressions. And so for example, in the second expression in the rule, it's gonna get bound to Bob in the path. And then in the third expression, that's gonna act as like an equality check. So it's gonna see whether or not the input user matches Bob. And in this case it would and so their request would be allowed. Okay, so now we're gonna add another rule called allow again to handle this second case of where someone is requesting the salary of an employee who reports to them. And so this rule is gonna have exactly the same structure we're gonna match on the path and match on the method. But this time, we need to do something a little bit different. And so the input data to the policy engine would be exactly the same, it's exactly the same, but we're gonna make use of additional data or context that's held in OPA. And so in this case, we see an example of the data on the left. And so we've got the management chain saying that Bob reports to Alice and Ken and Alice reports to Ken. And then what we're gonna do is use that data or that context to decide whether or not to allow the request. And so that's exactly what's happening in the third and fourth expression in this rule. So the third expression looks up the management chain for a given user. And then the fourth expression searches over that management chain to see if the input user is a manager. Okay, and so at this point, we've actually codified the entire policy using OPA. But there are a couple of other things that I wanna point out before I hand back to Manish. So the first thing is that in this case, we have this logic that determines whether or not one user is a manager of another. And while it's relatively simple, you may want to have this logic reused throughout your policies. And so you don't wanna duplicate it, you don't wanna repeat yourself all the time. And so what you wanna be able to do is share and reuse that. And so to do that, OPA gives you the ability to compose policy. And what that means is that you can basically take logic and you can split it, you can factor it into separate rules or separate functions. And then you can call those rules or functions from other rules and functions. And so in this case, we're gonna do just that. We're gonna take the check for managers and we're gonna pull that out into a separate function that'll return true if A is a manager of B. And then all you have to do is just update the original rule, obviously. So what I haven't shown here though is that all of these policies are actually contained in packages. And so they're actually namespaced, just like you'd be used to in a standard programming language, like Go or Python or whatever. And so that ensures that these policies are namespaced correctly and that they don't run into collisions. The second thing I wanna point out is that OPA is completely resource agnostic. So it's not coupled to any domain-specific model and this is the main reason why we can say that it's general purpose. Because regardless of whether or not you're writing policy over HTTP APIs or Kafka or SSH, it's all just data to OPA. OPA doesn't care, it doesn't matter. It's all just data. Now, obviously if you're thinking about enforcing access control in HTTP APIs or message brokers, your performance is gonna be absolutely key. And so this is something that we've designed for from the very beginning of the project. And so for example, if you take OPA and you try to use it to enforce a role-based access control policy where the policy basically has to search for bindings that match the authenticated user and then find roles that match those bindings, you see latencies of around 10 to 20 microseconds in the worst case. But the really cool thing here is that even as the dataset grows, the latency remains relatively stable. And so for example, in the second row there, the dataset that the engine actually has to search over is about six orders of magnitude larger than the first one. So it scales very, very nicely. Okay, and so while you can take OPA today and you can use it to enforce authorization policies in your services, you can also use it to enforce a variety of other kinds of policies throughout the stack. So for example, we have integrations and we've shown how you can use it to enforce admission control policies, workload placement policies, risk management policies, rights elevation and more. Now to do that, you don't have to start from scratch because we've got a bunch of great tutorials on the website and we have a number of free-built integrations that you can use out of the box for projects like Kubernetes and Docker and Istio and of course we've got many more coming. So I just wanna say that we're very excited about the Open Policy Agent project because it provides this reusable building block to the community and to the ecosystem and it helps solve fundamental security problems like authorization across the stack because ultimately at the end of the day, we all need a way to control who can do what throughout our systems. So before I hand back to Manish, I'd just like to point out, point everybody at the repo, please check it out, give us your stars and we also have a demo booth in the vendor area. So if you're interested in this kind of thing and you wanna see a demo, please come on by and say hi. Okay, Manish, back to you. All right, thanks, Tori. So, OPI is amazing, has a lot of flexibility and as you saw some of the policy snippets, it's not that hard from syntax perspective. However, we're talking about a company like Netflix which has hundreds of teams here and then they remember go back to the original requirement of self-serve. So I have to make this system self-serve. So these teams are very competent and everything but sometimes they forget their coffee. So I really don't want them to write any complex looking code. So what we had to do was basically make sure that their life is as easy as possible when they're starting to write their policies. So what we ended up doing was take two steps. So first step was we built a UI on top of this OPI language. So the complexity of the language is hidden from them. So I'll give you an example here for the... It's an animated thing, I don't know if it's very visible but this is the UI that does the exact same thing as underneath, it basically converts the UI action into OPI policy. So in this particular case, all I'm doing is saying that this post endpoint is only accessible, should be only accessible by performance review application, right? And then that this is what people... This is what I call capturing intent. Their intent is to just allow this particular application. This hopefully very intuitive UI allows them to do that without making much mistakes. And then if I make a second example of the get salary endpoint, it's slightly more complex because it has more than one rules, has more than one rule because you have employee, then you have manager, and then you have the report generator application. So in this particular case, you have three rules and as you see the animation is trying to do very similar stuff that OPI was showing is just that it's in UI format, right? Fortunately, in this particular case, all these three rules are not overlapping. So order of those rules won't matter. However, the way we write policies is basically if you have ever had pleasure of configuring IP tables in the past, you basically have all these specific rules at the top and the generic rooms at the bottom so you can catch everything here. So the way we have made this is the UI allows you to arrange your rules the way you want and it will be executed in the order it was listed. So that helps write policies in a way that you intend. We'll take all the questions. So one more thing we had to do was yes, this is good and then the, but it still doesn't actually answer the question, did you capture the intent? Because the intent is only with the person who's actually making these rules. So they know in plain English what they wanna achieve but they don't actually know exactly what they did is going to perform what they think they did, right? So the second step we took was basically we built in unit testing mechanism in this UI. So what we did, and fortunately I don't have a screenshot for that at this point, but what we ended up doing is we said, okay, you wanna write this policy? You finish writing the policy and then you write a test for it as in whatever you think you did, this test should pass. So you can have positive unit test or negative unit test and then before you actually save your policy and it gets pushed into production, it will run all the unit test and only when they pass, your policy will be updated in the production. Now what happens is policy is written six months later, somebody wants to go and add one rule to it and they completely forget all the intents that they had six months back. So these unit tests will save their day because the unit tests are saved with the version of that policy. So as soon as you update the policy, all these unit tests that you had thought about they will run before the actual, the policy is pushed into production. So yeah, we don't wanna be a gatekeeper as Diane mentioned this morning, but we do want to provide the guard rails and this built-in unit test is basically the guard rail that we built on top of the UI. So just to summarize everything here, we basically have this very diverse back end which has all these services that are using random different protocols and have all these different resources that they host and they have clients that look like people and jobs and VMs, bad jobs running in containers, whatnot. So we had to first solve the authentication problem which we did and then once we had that we had to make sure that the authorization system is flexible and extensible. Now latency was also a big deal. So we, I think Tore showed some numbers from OPA's perspective and when we did our own benchmark, this was basic policies could easily be done in less than 0.2 milliseconds. So which works for Kafka, if it works for Kafka for me, it probably works for all the other services at this point inside Netflix. At Netflix scale, coordinating updates is very hard. So if you had like any kind of hard coded rule mechanism and not using language-based evaluation engine, you're gonna have really hard time over time to push out any sort of updates. Once you have language-based system, it is very easy to support new kind of use cases. And then obviously for being culturally successful in a company like Netflix, your solution has to provide something that is, goes well with freedom and responsibility. So having a self-serve system with a good UI and good guardrails will actually make this project very interesting and successful. So in closing, I would say that something that you wanna take away is, authorization is a fundamental security problem. It is not new to cloud, by the way. It just, cloud just makes it more interesting because the way it works. And if you're not there yet, if you're not there solving this problem yet, you're gonna be there soon, right? You can't just wish this one away because you know, in our parents' days, you had network security, that was enough and definitely not enough in cloud environment. What I would say one more time is that if you are going to tackle this problem, try to see how you can have a comprehensive solution rather than some hodgepodge of nine different authorization systems in your back end because at the end of the day, if they don't talk to each other, you don't have a common place that you can go and have some visibility, it's gonna be really messy. And then you have open source projects like OPA that you can make use of. In fact, I came to know about OPA only like earlier this year. And I knew my requirements, but as soon as I saw that, I'm like, this fits my requirement. Even if let's say a language is not touring complete, it doesn't mean that it's not good enough. It's still a language, right? So you should go around, look for open source projects and make sure that if it fits your requirement, you're able to get there faster. And the last one I would say is you don't have to build this alone. This problem is not necessarily new. So a lot of people are thinking about it. There's a very young community called Padme. They had actually a session earlier today. So if you are interested, maybe you should get involved in the community so that you can solve with other people and you may even end up learning something more about this problem. And you may find some more use cases you may not have thought about. So thank you so much. I think we can take a couple of questions. That's all. So question is what is, is it available for public use? What part? Yeah, so the open policy agent, Europa, that's totally open source. It's been open source since day one. It's Apache 2 licensed. You can check it out on GitHub. The UI is... So UI is purpose-bid for Netflix at this point. But I would say UI is very, very specific to your environment as well. So, and I don't think it's the biggest component of this whole project anyways. So, yes. How do I compare this project with HTO initiative? I will not try to compare this because I don't know a lot about HTO's initiative about authorization, but I would say one more thing. Remember, I have to solve this problem for even SSH. So I don't think HTO does SSH, right? And we can talk later, but I mean this project started about a year back. And I had not heard about HTO back then. But yeah, we can talk. Yeah. So I should have mentioned that. So the question is, does the distributor pick up only set of rules to send to an agent? So we, from day one, we designed this system in a way that not only it sends the very, very specific rules and only things that are applicable to you, but the updates are delta updates. It's not sending everything. Anything that just changed, only those things are sent over the wire. Otherwise, this will just become a mess. You're right. By the way, we are right here for next 10 minutes or so. And then if you have more questions, feel free to come by. Thank you so much for your time today. I hope this was helpful. Thank you.