 Hi, everyone. My name is Jelvan Haudenbicke, and today I'm going to talk about authorization at Square. I'm a software engineer on Square's developers IAM team, and I've been on this team for almost a year and a half now, and the entire time on the team I've been working on authorization, which is what I would like to talk about today. Here's a quick agenda of the different topics I'm going to cover. First, I'll talk a little bit about envoy at Square, then give you a quick overview of our previous authorization architecture, some of the problems and challenges we face with that, and then how we're leveraging envoy to do authorization. Before we do that, I do want to talk about authentication and authorization. Authentication is a process of verifying your identity. Are you who you say you are? Is this the authentic Sally? While authorization is a process of verifying that someone has to write permissions and is allowed to do what they want to do. And during this talk, we're going to cover the second bullet point, which is authorization. And I know that very often these two go together and I would encourage people to try to think about them separately. We are definitely trying to decouple these two at Square as much as possible, because I think that gives your architecture a little bit more flexibility. Next, I would like to introduce you to SAFE. SAFE is our session authorization framework enforcer. It's one of the authorization frameworks available for service owners at Square. Next, Envoy SAFE, which is our Envoy based authorization solution, which is the main solution I'm going to talk about today. And it takes a lot of the things that were available in our SAFE framework. And it took that and moved it to an actual service that we are leveraging with our service bash. Next, Envoy at Square. So there's a great talk that two of my colleagues gave at this exact same conference two years ago. And a lot of the things they talked about then now are a reality at Square. So we're at the next level where now we can leverage our service mesh to build a lot of these new exciting features. And so some of the highlights and things to keep in mind that are important for this talk is that Square has a centralized control plane. And the control plane has a pre-configured cache with sidecar configurations, also known as snapshots. And now a quick overview of authorization or how we used to do authorization at Square. We had multiple authorization strategies. So services could implement different libraries or leverage different libraries to do authorizations. Some of these were using Protos. Some of these other libraries such as SAFE, they had ACL-like files where you could specify the different rules and authorization requirements while a third set of services were using custom code, no additional library, and it was all written in the actual application layer. On top of this, Square supports three major languages, several minor, and the same authorization solution is not available in all languages. What that means as a service owner, if you have multiple microservices written in different languages, it is possible that you can not leverage the same authorization solution for both these microservices. So in reality, that looks a little bit more like this. And even though we try to keep these libraries in sync or keep feature parity as much as possible, that is not always the case. Some features get implemented in one language, then they get deprioritized. Others still haven't been developed. So there is always a little bit of a difference even between the same authorization library in different languages. On top of this, we have a different permission set for our private APIs and our public APIs. And only one of our authorization frameworks is able to authorize again full sets of permissions and map them together. So what that means is that if you're using that framework, you can completely use this authorization layer for both type of APIs. While if you're using a different library, you still have to implement an authorization, you have to implement some authorization code for private APIs. As you can imagine, even though this works, it definitely presents multiple challenges. Some of these challenges are, it's really hard to know what is, if all our microservices are running, the latest version of our Autsy framework. It's also hard to just roll out new features because you have to implement them in multiple languages. It's hard to even roll them out for all the same services. So you might have to implement the same feature in a different library as well. On top of that, given the two different permission sets I just mentioned, it's complicated to reuse our public APIs internally. And then a lot of people would reach out to us and ask us what is the right authorization strategy? What is the right framework to use? And there was not always a clear answer to that question. Besides all of these problems, another challenge we had was for our InfoSec team. It was extremely hard for them to do audits because they would have to look at these acro files, proto files, or even custom code to look at things like, is this endpoint exposing PII data? If it is, if it's requiring the right permissions that it should, given the data it's exposing, that was a very hard question to answer. So we try to come up with a few solutions. Some of them are, we need a consistent authorization strategy. And then we could also, we talked about unifying both permission sets so we could reuse our public APIs internally. To solve the InfoSec problem, we thought about a centralized source of truth that they could use to actually look up resources, look up their requirements, and see if these permissions match what is expected from a security perspective. Some other things that came up is, we need a single authorization point. That way we can make sure that everyone is using the same code to authorize and they're always using the latest version available. So these were some of the solutions and motivations and then we start thinking in how we would actually implement these. As I mentioned, some of them, we would be able to address this with a centralized source of truth. Some other issues, we could fix them by having a single authorization point. And then we also wanted to have a deny by default approach, which we're still not quite sure of how we were going to fix that. A deny by default, what it means is if a resource or a endpoint has not defined any authentication or authorization requirements, it gets a night. So you have to explicitly define these requirements before your endpoint will work correctly. At that point, we had been looking at the external OZ filter that's available in Envoy because we had reached a point that's where all these services had Envoy sidecars. So now leveraging Envoy became a real thing. And so for those of you who are not familiar with the external OZ filter in Envoy, it's personally my favorite extension. It has made my life so much easier. And basically the way it works is that it's a filter that will call an external service, send it the original request and the external service can then make a decision if that request is authorized or not. If it's authorized, it will return a 200 and then Envoy will move on to the next filter and eventually reach the application layer upstream. And now the application upstream knows that this request has been authorized. If the authorization service, besides that this request is not authorized, it's lacking certain permissions. It can return an error such as a 403. In that case, Envoy will take that response, return that to the client, including the response body that the authorization service is returning. So basically this is how this would look like for a successful request. So the client sends a request which gets proxied by Envoy which then calls the authorization service receives a 200, then Envoy forwards the request to the application layer that who eventually will return that to the client. As you saw earlier, we use a lot of library code at Square. We did not have a authorization service. So we had to build an authorization service to accept and support this external OZC filter. For the authorization service, we decided to use a database as a source of truth for all authentication and authorization requirements for all routes at Square. And we backed that up with a Envoy save UI and this UI is what allowed service owners to configure their routes, configure their requirements and essentially that's what would be enforced by Envoy. This is a quick preview of how that UI looks like. We did go back and forward on should we use a UI in a database or should we use ACL files that can be checked in our search control. And the main reason we decided to go with the UI in the database is because we're still making a lot of changes. We still want to make some improvements in our authorization model and making these changes having a database is slightly easier than having to make that in files or static files. On top of that, if you want to make a change to your authorization requirements with a database, you can do that immediately. While if you're using ACL files, it would require a redeploy of the authorization service. So since we have somewhere around 250 services that we were trying to migrate every single time one of those services makes a change, we would have to redeploy the authorization service for that change to show up in either staging or production. Next we had the solution in place where Envoy is calling the authorization service for every single request. That's when we introduced the concept of protected and unprotected routes. Unprotected routes, there are routes for static content, blog posts, images that do not need authorization. So for those routes we really don't want Envoy to call the authorization service because that's a waste of resources for both the authorization service and the request itself. So in order to do that, we started, we gave service owners the option of seeing if their routes were required authorization or authentication. And then we built a integration with our centralized control plane and our authorization service, which would now send over to the control plane the unprotected routes and the services. So when the control plane is building a new snapshot for a Envoy sidecar, it would know for which routes it had to disable the external odd Z filter. Next we had this in place, it was great and now we had a migration challenge. So we had the solution, but we still had 250 services that we now needed to migrate. How do we get all those rules and authorization requirements into this central storage? This is when the team decided to invest some time in building migration scripts. If you remember from this previous overview, different libraries, some use ACL files, some use Brota files, and what we did is we built different scripts that would extract the rules from these files and call temporary endpoints in the authorization service. So we could store that data in the authorization database. This turned out to be a great solution, mainly because it was a very flexible approach. We were the decision makers, so we could allocate as many resources to this problem as we wanted. And at the same time, it kept both authorization strategies in sync. This was very helpful as we were rolling out EnvoySafe, having the ability to disable it, knowing that there was still a backup strategy, this library would be up to date and would have all the right requirements in place. Next, we had a set of services that had their authorization requirements built in the application layer. Unfortunately, there was no easy way to extract that data and migrate it to the authorization service. So we had to work with service owners to have them manually migrate these routes. This is not as ideal, mainly because teams have their own deadlines, their own schedules. So we had to work with that. And even though teams were very supportive, there's still no automated way to keep both strategies in sync. So now we have to ask teams, hey, you have to update your permission requirements in both places until we're fully rolled out and you can actually deprecate the code in your application layer. Next, I would like to talk a little bit more about our rollout strategy. So at this point, we had a solution in place. We had a lot of data and we were ready to try this and roll out for multiple services. And first, what we did is we introduced a logging only mode. A logging only mode, what that does is that our authorization service would always return a 200, no matter what the actual authorization decision was. So Envoy would never short-circuit the request. We did this with additional metrics and logs. So service owners could actually compare the decision the authorization service would have made versus what the existing library or their existing application layer actually did. That allowed them to tweak the requirements, tweak some of the permissions or configurations that they had in place through the Envoy safety line. Our next rollout strategy is we use the runtime fraction configuration. This allowed us to split some of the traffic and roll out on a percentage-based approach. So what we did is that we would roll out Envoy safe for 5% of a given service traffic. That allowed us to make sure that the authorization service was hitting the right SLAs. We were able to handle that QPS and we're also sure that we're not blocking any traffic. We should not be blocking and do a little bit more of a control rollout. So in order to support this percentage rollout, we introduced that as part of an admin panel and we had that passed to the Envoy control plane through that same integration that I mentioned earlier for unprotected routes that did require some changes in our data model on the control plane side to support these different values but that worked out. And then next I want to talk a little bit about some of the lessons learned. First, in our UI, we allowed users or service owners to use wildcards for given namespaces and mark that as a unprotected route or a protected route and specify and group certain permissions and endpoints. This introduced conflicts. As you can see, someone would mark a wildcard or all traffic as unprotected and later on a more granular route with actual authorization permissions requirements. So we had to build some logic around that to detect these cases and either notify the end user through the UI saying, hey, you're introducing a conflict. You might want to consider specifying a more granular routes or we had to be very clever about how we organize these routes when we send them as unprotected routes to the authorization service. But it's definitely something to keep in mind because we missed that initially. Debugging becomes a little bit more challenging because now service owners have to rely on the logs we use in the service mesh or in the authorization service so it's a little bit harder for them to, they cannot add any custom logging or custom metrics. They have to rely on a more generic output. And next, some shortcomings we noticed with the external Aussie extension is that as I mentioned earlier, it's my favorite filter and I'm not the only one who thinks that. So there are a lot of teams at Square who are trying to use this filter not only for authorization but for other use cases as well. So this filter, I think it's very versatile so it can be used for multiple use cases and solve different problems. So as far as I know, there's no other filter that allows you to call a service, mutate the headers and mutate the headers of the original request. And there is no out-of-the-box way to differentiate two external Aussie filters and deploy different configurations. So at the same time, there's no way to enable or disable them individually. That caused some conflicts between different teams who are trying to use this filter and we cannot use them for the same services because if they disable it, they're disabling our solution. If we disable or enable it, we're also enabling the filter for their solution. There's also no way to bypass the filter for a given header. That would have been useful in some cases to make sure we do not reauthorize or re-authenticate a request twice. And then you can't change the class of how the GRPC call works. And that limits you if you wanna have a microservice that implements two endpoints that could be called by the external Aussie filter that is no longer a possibility. So conclusion, we decided to move all our authorization from app and library code into the service mesh with a centralized source of truth, where we are. So currently we start rolling this out in production. We are targeting to roll out EnvoySafe for closely 250 services and we're expecting to handle somewhere around 20K QPS. Then our next steps is going to be focus a little bit more on the coupling some of our authentication authorization strategies. And then that will also allow us to implement a new and more flexible permission system that will allow us to implement even more features from a application site. That's all I had today. Thanks everyone for listening. This is my email if you wanna reach out. I would love to hear how your team is solving authorization. And if you're interested in working on some of these problems, Square is hiring. Thanks everyone. Hey everyone. Thanks for listening to the talk. Let me know if you have any questions. How long are you projecting this rollout to take for say 90% of all services? It took us about three, four months to get most of these routes into our database and have service owners review them, update them and start rolling out and staging. We're currently working on our rollout production. We expect this to take somewhere between four to five months easily, probably a little bit longer. What was the overhead like from running these AT calls not only out of process, but over the network? That's a good question. I think some of our request calls we have an additional latency, depending on some of these calls, I think our average is still below 10 milliseconds, which is pretty good. For a while we were making two calls, one to the authentication service and one to the authorization service. And so that was doubling almost that additional latency. Is MTLS part of any of your architecture with ATC calls? Not for these ATC calls, we're definitely using MTLS for some other authorization strategies that we have in place, but not for this particular design. When DB updates are done through the UI, how do you put the changes through the ATC service instances? Yeah, so the way that works is through the integration I mentioned earlier between the control plane. So basically when somebody updates through the UI, some of the routes, some of the permission requirements has got synced to the database and then our authorization service and the control plane, they are constantly syncing. So when a envoy sidecar will request or hit our centralized control plane, it will get an updated configuration. And I think right now we can get that done and there is a max latency of five minutes. How do you deal with the situation where safe and other teams used to separate external OT filters in the same pipeline? That is a problem we haven't solved yet. It's something we're working on. We are considering making changes maybe to the external OT filter itself. So we can identify these and enable or disable it depending on the configuration or give it some sort of identifier. Right now we have not solved that problem yet. Are your OT decision always base path-based or is there any application business logic? So far it's all path-based and the main reason is that we try to keep as much application and business logic out of the authorization service. That's also why we're trying to decouple some of our authentication service from our authorization service to keep some of that logic separate. We are working on a different authorization model where you can enforce some of like more business logic or business decision and specify more granular rules. That's still under development but we're hoping by once we have that to have a little bit more flexibility and you can enforce some certain parameters and because right now that still lives in the actual application layer. So it's still possible that a service will get the authorized request and need to do some additional authorization logic in order to, especially if it's very close to you to the data model in that service. Did you look at OPA for Etsy? If so, why did you choose a DB model? We did not look into that too much which is the DB model mainly for what I mentioned there for what I mentioned during the talk flexibility was the main reason we decided to move forward with a database model. Also we want to have the UI integration which is some of the features we were looking to implement. Did you guess the result for better throughput? We have not. It's something we've considered. It's an optimization we might implement later especially since as I mentioned as well there is no way to actually skip the external OTSY filter which is possibly a good thing but yeah, we might consider caching some of the OTSY results. Does each of the OTSY hit the database or is there a cache? We have a cache that has most of these paths and rules already preloaded. It's safe OTSY done by save. I'm not sure what that question means so we had a safe framework that was doing OTSY and then basically we took a lot of the way that authorization was being enforced by that framework and we moved that to a service but it's essentially a re-implementation of some of those authorization rules. So safe and on voice, it's a different implementation but they use the same concepts, same permission sets and some of the mapping between these permissions that we had for private APIs and public APIs. So there's some like shared concepts that these two share but the authorization service has its own implementation and does not reuse the existing safe framework. I think I answered all the questions. If I missed any of your questions, feel free to reach out or repost it again and I'll try to answer it. I left my email in the slides so feel free to reach out and happy to chat a little bit more. Awesome, thanks everyone.