 My name is Kevin Louie, and this is Callin Rain, and today we're jointly presenting on how we scale back in authentication at Facebook. Since Facebook's security architecture involves a whole lot of complex components, this talk is unfortunately not going to be able to cover every detail of the entire system. Instead, we're just going to focus on highlighting the most interesting aspects of our security infrastructure and also the tools we help made to scale authentication. So last month, we announced that our fleet consists of 11 data center locations across the US and Europe. Our data center has worked together to host 2 billion monthly active users, and as a result, our internal infrastructure is incredibly intricate with layers upon layers of caching, replication, and cross-service coordination during each user interaction with Facebook. So keeping this entire infrastructure secure can be pretty challenging on multiple levels, and at the very least, we certainly don't want any external actors to have unimpeded access to our infrastructure. But even though we pour a lot of resources towards enforcing a strong network security perimeter, it's also important for us to reduce the surface of trust within this network perimeter. So primarily, we do this by enforcing granular access control between services. To help understand the structure of these access controls, we can classify our internal services into roughly three categories based on their proximity to our route of trust. So our route of trust includes our core security services, with which we bootstrap security for the rest of the network. This is essentially our walled guarding and includes, for example, the only machines which have access to Facebook's master keys. The intermediate layer of trust includes machines which are allocated for some of our critical backend services. An example of something in this layer would be machines which hold a service-specific encryption key, which is used by the service to access its own data. And finally, the lowest layer is everything else, including the machines that directly serve web requests to load Facebook.com. In order for these three tiers of trust to be meaningful, it's important to keep the walled garden as small as possible. So having fewer machines in the top trust tier means that we can limit the number of hosts which can access master keys, which then means that we can implement solutions which increase the physical security of these machines without having to worry too much about scaling issues. Also, we can carefully audit each and every attempt to manually access these machines. Now if the top tier of trust had to scale with a total number of Facebook servers, then we would find that all of the extra preventative measures we applied to this tier would be a lot less effective. How to keep this tier as small as possible while transitively extending its security to the rest of the machines in our fleet is the motivation behind many of our efforts in securing our infrastructure. So before we dive into how we address this central problem, we'll start by introducing some of the fundamental building blocks of authentication and authorization that we rely on at Facebook. So these four components constitute the root of trust within Facebook. First we have a key server which has access to all of our master keys used for cryptographic operations. When a client wants to encrypt or assign something, it can send the plain text over to this key server which then returns the encrypted results or the signature. And this means that we don't have to distribute these master keys to all of the clients which need to use them. We have a root certificate authority which is responsible for issuing and signing certificates which associate identities with machines and a login service which is responsible for issuing and signing user sessions which associate identities with external users. And finally we have an authorization service which controls all of the permissions for which entities can access resources within our infrastructure. The central components of authentication and authorization are identities and ACLs which stand for access control lists. We define identities in a flexible and granular way that allows us to associate a unique identity for every user, every machine, every service, job, and so on. These identities can then be listed in an ACL which is just a publicly verifiable mapping from a specific resource name to a list of identities which can access that resource or perform that action. Okay. So with these pieces in place we can establish securely authenticated connections between hosts in our infrastructure. So in this diagram we have a service owner who is going to start a job. We're going to call that client in green. It's going to need to communicate securely with a remote server. So the first thing that happens here is that the service owner requests a certificate from the root CA on behalf of the client. The root CA here has a list of all the ACLs for all the possible clients and it's going to check the service owner's identity against the ACL for the request of permission of that client's identity to make sure they can actually have access to deploy a job with that identity. Note that the ACLs here are all coming from the authorization service which is distributing them. The certificate that the client receives will have the client's name in an extension field. We use an extension field instead of a common name because we want to put a lot of identities in there and there's actually a limit to the size of that we can use. So once the certificate has been deployed to the job, it can be used for TLS connections that the client can negotiate with the service. So let's look at the server side. When the service receives a request, it's going to parse the client's certificate and extract identities from it. It will then have an ACL that it has loaded from authorization service and it's going to check that client's identities against that on the list and it will allow or deny access. So here's a slightly simplified version of the diagram from the last slide. So this shows authentication and authorization with one client and one server. So if the authorization model in your infrastructure is this simple, this TLS is the state of the art. We get integrity, we get privacy, we have an authenticated connection. This is actually not far from what would actually happen for direct isolated connections between clients and servers in our infrastructure. However, as you all know, production environments are not this simple. So in reality, a large production network is complex and doesn't fit the simple model of point-to-point authentication and authorization we just showed. So there are two key differences I'd like to highlight between that model and what actually happens. So first we have a complex service architecture. We have requests that can travel through layers of hosts before reaching a destination that will perform authorization based on the original client. At first, it seems like we could just perform authorization on each hop of that request chain of viewing each one as a separate point-to-point authentication that we could authenticate and authorize. However, Kevin spoke earlier of the different levels of trust within our infrastructure. What happens when a client needs to communicate with a service through a less trusted intermediary? So let's look at that. For example, here we have a single intermediate proxy transmitting requests from a client to a destination server. The client and the proxy have distinct identities. These were deployed with certificates. So the client's original identity gets lost when the proxy terminates TLS in the first hop. The request would then fail the ACL check at the server. So if the proxy is only used for one service, this isn't an issue. We just treat the proxy as logically equivalent to the service. So to do this, we would just put the proxy on the ACL, and then both the proxy and service would check the same ACL. But if we don't want to consider the proxy as trusted as the service or client, we wouldn't want to give it unrestricted access to the service like this. This also is more difficult if the proxy is shared or if there are multiple proxies, and this could look something like this. So here we have the proxy, and it's now responsible for an ACL for every service accessible through it. And this increases the complexity of the system. Sharing responsibilities for authorization like this increases the operational overhead involved in managing authorization over time, and it's liable to be misconfigured. So to summarize here, relying solely on TLS authentication authorization can be less practical when we have intermediate servers in our request flows. So we've found that attaching cryptographic tokens to requests can be a useful companion to TLS-based authentication for a few reasons. So first, they're fairly flexible. We can put anything we want in tokens, and we aren't constrained by specific protocols or data formats. We can represent identities for users, services, machines at Facebook, not just the identities that we issue certificates for. These tokens are also quite portable. They can be easily forwarded through proxies or immediate servers. They're generally transport agnostic, and are just simple serialized structures we send over headers. So finally, tokens allow us to scope the rights of the identity they carry. Each token can be limited to a narrow use case, ideally to a specific request. This limits the impact of a compromised token to the overall security of the system. So this kind of scoping isn't as easily done with TLS authentication. So for the remainder of the talk, we're going to present two types of tokens we created for authentication. The first is a public key certificate-based token, and the second is a symmetric key variant called crypto off tokens, and we call them cats for short. This leads to a lot of cat puns. So certificate-based tokens can be seen as a simple extension of TLS authentication. Essentially the idea is we use the existing credentials the client already has, the certificate and private key we talked about previously, and we make a token out of it. So this token is going to be serialized and attached to headers of each request. When the token is received by the server, it is validated and used for authorization, and you can see how this would solve the problem of the proxy that we presented before. So here's a closer look at how the tokens can be constructed and validated. So to create them, the client creates a structure with their certificate and metadata describing the use of the token. So this metadata will contain the intended proxy to use. It will contain the resource it is used for, so at the destination side, and then any actions will be performed on that resource. And this resource is generally mapped just to a specific ACL within our authorization infrastructure. We'll also add some metadata I don't have added here, things like timeouts and other additional application-specific data we'll put in there. So we then serialize this, and we're going to sign it with the client's private key corresponding to the certificate. So this is added with the signature and serialized into the final token. So when the server side, once we receive a token to validate it, we're going to first make sure that that signature matches the certificate inside, and then we validate the certificate to make sure it was actually issued by our root CA. We also need to validate the metadata, which involves checking that the service relaying the token has a valid identity, so this is the proxy part, which is the token. We need to make sure that the token is actually applicable to the resource that is using it on the server side, and that the scoped action that is in there, action or actions, matches the actual request. If everything checks out here, we will use the certificate as if it were coming directly from the client. This just means we extract the identities and check them against the services ACL. So we implemented and deployed the certificate tokens, and as you may expect, we quickly found that creating and validating them on every single request becomes completely infeasible very quickly for any service with a reasonably high request rate, because these are all public key operations, and they're very CPU intensive. So what we can do here is we can add just an LRU cache, one for the creation of the client and one for validation at the server. The creation cache is relatively simple. It only holds the small set of tokens for that client to use for other servers. The server side validation cache is a little bit more complex. So for creation, we have to generate a single public key signature, which then must be validated by the server along with the certificate chain. Each client must generate a unique token for each proxy server resource action that it communicates with in some request flow. This can be used then for any destination server that is involved in that request flow. So this means that the creation cache can stay quite small, because a lot of these clients, this doesn't blow up very quickly. So the validation cache is a little more complex. The client can reuse a token for two machines running the same service, but if we have two client machines making requests to a single server machine, even through the same flow, the server will see distinct tokens, because the client certificates are distinct. So as a result, the size of the server side validation cache will scale with the number of unique clients that are sending requests to it. This can get quite large. Because of this, we can't actually cache the entire token. It ends up being just a lot of memory. Instead we're going to just extract the metadata and identities from the token, get rid of the certificate, get rid of the signature, and just put that in the cache. Without that in there, the rest of the data in the token can be efficiently cached without much memory blow up. So you might note here that you got rid of the certificate and the signature. How do we make sure that's equivalent? Yeah, we should make sure that a cache validation is logically equivalent to an uncached one, so we don't have any security vulnerabilities that we introduce here. So we've been using these tokens for long enough to have a pretty good sense of their advantages and limitations. So from an engineering point of view, these tokens are extremely scalable and reliable. They have no external dependencies at runtime, other than the client credentials that are already present on the host. As a result of this, we can attach them to tens of millions of requests with our infrastructure, with extremely reliably. They're also quite simple. They piggyback on our existing security infrastructure, this PKI we have for distributing identities. This means we have less code to write and maintain, and fewer dependencies that can fail. So lastly, the scoping we use of these tokens is generic and pretty reusable. What I'm talking about here is the proxy resource action model. Because of this, we've integrated them into Facebook's routing framework, so they're just automatically added to any requests that need them. So they also have some drawbacks. First, the tokens are quite large. They contain the entire client certificate, and they have an additional public key signature on them. So with some protocols, connection-specific headers are actually reset every single request. So this is extra bandwidth that we'll add up. The fact that they're public key makes them CPU-intensive to create and validate, we can mitigate this sum with caching to make them feasible for services with high request rates, but that's not always enough. Lastly, the tokens are quite coupled to X7-9. This limits them to any identity we can put in a certificate, which is not optimal. It's not all the identities we want. There's also lots of data being transferred in a certificate that isn't technically necessary for authentication, but we have to keep it there because we can't tamper with a certificate. So to address these limitations, we also have an alternate symmetric key approach in these tokens that we use in conjunction with these certificate based tokens. So the way that these symmetric key tokens work is analogous to the way Kerberos handles one-way authentication in that the client asks the key server for a session key or in Kerberos language a ticket, which is tied to the service that the client wants to communicate with. However, our setting is slightly different because we're building authentication on top of the already existing TLS used to secure point to point connections, which means that we don't actually need to support every part of the Kerberos protocol. Also, our setting differs from Kerberos and how we construct session keys. So the session key is a shared secret between the client and service, which can be used to MAC the request. And this MAC is stored in our token along with some extra information about the client. Internally, we refer to these tokens as CATs, which stand for crypto auth tokens, and they can be seen as the symmetric key variants of certificate based tokens. OK. So since these tokens are not certificate based, they are both flexible enough and fast enough to handle authentication for 2 billion users. In our CAT protocol, when the client makes a request to a service, it must first ensure that the request has been authenticated on behalf of an acting user through our login server. It's connecting to must obtain a service key from the key server. And the service key is deterministically derived from a root master key and the name of the service, along with some auxiliary information like time stamps and other properties of the service. We can think of the service key as the output of a PRF, or a pseudo random function, using the master key and the service information as its input. Then the client requests for a session key by interacting with the key server, specifying the name of the service it's connecting to. The key server, which verifies the authenticity of the client, creates the session key deterministically by re-deriving the service key and then performing a second PRF evaluation using the service key with the client information as the input to the PRF. And this is where we divert slightly from the Kerberos authentication protocol. So in Kerberos, this session key would instead be picked completely randomly, and the key server would encrypt the random session key under the service key and send the resulting blob to the client. And it would be the responsibility of the client to forward the encrypted session key to the destination service. The advantage to deriving session keys deterministically is that the mapping from the client's service pair to its session key does not need to be stored explicitly. And so we bypass the inherent reliability and scaling issues that can come with having to query a data store on the critical path of every client request. In our model, the client can now use the deterministically derived session key to create a cat, which contains the Mac over its request to the service, along with the client information. Now, the service can use its own service key, along with client information, to locally re-derive the session key without having to request anything from the key server. And this session key can then be used to verify the Mac contained in the cat over the client's request. And finally, to handle rotations for the session and service keys, we encode timestamp information as a part of the PRF input so that these keys can be enforced to only last for a certain time period before needing to be refreshed. The advantage to using cats is that we move away from public key primitives. And hence, these tokens are more performant and can scale without caches. The single Mac is also a lot shorter than having a certificate and signature embedded in each token. And finally, these tokens don't rely on X509 certificates. So we can move away from identities that are tied to physical hosts and can more readily handle implicit identities associated with users who perform requests. So to summarize, there are three main takeaways from this presentation. Firstly, we have different tiers of trust for our machines. And we constantly aim to keep our route of trust as small as possible. As a result, we can't solely rely on authentication and authorization guarantees that we get from securing point-to-point connections. To tackle that problem, we implement two solutions which allow us to extend authentication beyond the connection, each with or on trade-offs. One variant is closely tied to certificates and therefore is based on public keys, whereas the other is based on symmetric keys. Both of these token primitives have been tested thoroughly within our infrastructure and are being used to authenticate millions of requests each second. So before we finish, we just want to acknowledge all the engineers we work with who have contributed to the development of the systems we talked about, including these tokens and all the different services we mentioned. The techniques and motivations we present today are not at all new within Facebook. This is a lot of history about how we actually arrived at this current state. And many, many people contributed to the refinement testing and development of these systems over many years. We also fully expect that the technologies we present today are not the end of the story and there's still room for improvement in our approach to back-end authentication and authorization. So we're looking forward to input from the academic audience as well. So thank you for listening. Thank you folks for a fantastic talk. We have about 30 seconds for last minute questions. We can get one in, sure. Proxy replaced the tokens and pretends to be the server. Why isn't this the client? Why isn't this not a problem? Yeah, that's a great point. So we can't assume that the, we have to partially trust the proxy here in the certificate-based tokens. If you have completely malicious code in the proxy, it can kind of do it over at once. We see that as slightly better than just giving it the client's identity. So it just can't forge requests, but yeah, we are open to those replay attacks. Sorry, one question. Have you guys done any presentations on the service identification, service identity part of your system? I didn't quite get that. Have you guys done any presentations on the service identity part of the system? How you give services identities and manage them? We haven't presented on that yet. I think there actually might have been a talk from our transport security team at a Facebook hosted at skill conference you could look up, but I'm not totally sure about that. But we would love to do more of these in general. Yeah, we'll be back. Thank you folks very much. I think we need to let everybody go. So let's thank the speakers one more time. Thank you.