 Hello everyone, my name is Ryan Turner and I'm a software engineer at Uber and today I'm going to talk to you about something called production workload identity and explain what that means and how we can use it to improve scary posture. And how we can achieve that with something called spire. So to start I want to talk about a fictional services organization which I've invented for the purposes of this talk. It's called starboard games, and they want to get into the business of selling board games as an online retailer. So let's take a look at how they might initially build this system. So first they come up with a three tier architecture to represent different layers of the application. So this is composed of three different layers one is the storage layer which is a persistent stateful layer. And that contains data that is represented in the system. Then there's an application layer which interacts with that stateful storage and creates reads updates deletes that data. And then there's a web presentation layer which is accessible to users. And that's how they interact with the platform and that web layer talks to the app layer to do different operations and system. So this is all in this example running in an AWS VPC. And as an online retailer, this company wants to provide some level of security so that not everything in their production deployment is accessible over the internet. So they do this naturally through VPC at goals. And they basically set up three different Apple policies. One says that only things from the internet can access the web subnet. Only things from the web subnet can talk to the app subnet and only things from the app subnet can talk to the storage subnet. So pretty simple. And but over time they grow and they decide they want to try to decompose some of the functionality in this app subnet into several different microservices. So they move more towards a microservice architecture. And suddenly their deployment has gotten much more complicated. So you'll see in the middle and bottom layers where there was only one subnet there now that's tripled where the service in the app subnet has now been decomposed into three different services. The account service and order service which talks to the account service to get information about an account. And then a recommendation service which talks to both the account service and the order service to provide customized recommendations to users. And each of these services has its own database. So, all of a sudden this middle and bottom layer has gotten much more complex. Additionally, the operations team has come up with a requirement to be able to access the instances running in production through a jump box. And so there's a new admin jump box of net, which is a much more privileged subnet, and can talk to things in the web subnet, as well as things in the app subnets. So quickly, the organization realizes this is something that's difficult to manage. I want to talk about what this organization's long term goals might be. So, in this hypothetical example, let's say that they have some growth plans here to expand the retail business from North America to Europe and Asia, so to have more of a global footprint. They also want to provide an online board name experience, where users can subscribe to their platform and play multiplayer games online together. And we also want to host board game tournaments and physical locations where users can register for these board games and meet up in person. And those board games are sponsored by a board game company or some other companies, which provide this starboard games company some revenue. And what are the technical objectives here that this organization might face. Well, in the previous architectures we've seen that each instance of a service runs in its own virtual machine in AWS. And we know that this is not the most efficient way to run services today. There are many operating system services that are being duplicated for every instance of a service. So this organization wants to adopt containers and run their services as containers so that they can have a more general compute pool of resources and run instances of their service across that pool. They also want to streamline their deployment and most effectively use those compute resources in their compute pool using Kubernetes. Additionally, they have a requirement to use a native GCP service for the online board game experience that they're trying to build today. They only have a deployment in AWS. So this is a new set of challenges that they need to overcome. So with all of these growth plans and technical objectives. There are some new security challenges that we need to consider. So the business now has a couple of new priorities based on these growth plans. One is that they need to be more active in preventing ordering subscription fraud. Now that they're a global company, they're much bigger target for online fraud. And that has a big impact on their overall bottom line. They also want to protect users privacy and confidentiality. And this is important because they are now introducing physical meetups for the board game tournaments. So something about users location at a certain time could be exposed if an unauthorized actor or malicious actor was able to obtain that data. So this is kind of sensitive data that we want to protect. So what are some current challenges we might have with this existing architecture to satisfy these security challenges? Well, the current service to service authorization policies are all basically just network level ACLs. So since everything is running an instance in per VM, we can kind of design things around an IP address or a DNS name as a way of identifying a service. But when services are running in containers on the same host, those constructs suddenly lose meaning. And as a microservices deployment gets more and more complex, this does not really scale. Additionally, this perimeter based security model is really not sufficient on its own to protect unauthorized access to services in the network. Take for example this admin jump box. If anyone were able to compromise in operations teams credentials and get SSH access to that VM in the subnet for the admin jump box, they basically have unfettered access to any of the microservices running in the VPC. So this is quite dangerous. So now the question is where do we go from here? How do we actually achieve these security objectives? Well, I think it's pretty clear we need a new model. The existing network based model is not really working for us the way we want it to. So what we really need is a secure, precise way of identifying workloads to enable strong authentication in services service communication in order to limit access to only authorized actors in the system. So this is kind of the mission statement here. This is what we're trying to achieve to satisfy the business challenge security challenges that we've identified. So how will we actually go about this? Well, this is where spiffy comes in. So spiffy stands for secure production identity framework for everyone. And it's an open source set of specifications that define what is a workload identity, how do you represent it, and how do you as a workload obtain its identity. And this concept of a workload identity and spiffy is not tied to any network level constructs like IP address or DNS name. So it allows us to, in this more microservices container based world, identify workloads. So how does spiffy describe a workload identity? Well, it describes it through something called a spiffy ID. And this is an identifier string, which is actually represented as a URI. And there's an example of such a spiffy ID. So this is one example. We have three components to spiffy ID. One is a static scheme, which indicates that this is a spiffy ID. So it just is a static component. And then the host portion of the URI is something called a trust domain. And a trust domain is really like a logical security and or administrative boundary for trust. And then finally the path component. This is really a user defined value. It can be anything. The specification does not really dictate what the value is other than what characters can be in the path. But basically this path is what is representing the name of the workload that other workloads can use to identify it. So in this case, we're saying this is the entity workload running in the mydomain.org trust domain. Okay, great. So we have a way to refer to workload, but that in itself is not really enough to tell us for a given RPC, how do we know that this service is who it claims to be. So how did we enable this kind of authentication, the strong authentication, using the spiffy ID building block that we just talked about. Well, this is done through something called a spiffy verifiable identity document. So spiffy defines this specification of an SFID. And an SFID is a digitally signed document, which allows workloads to basically enable strong authentication using cryptographically verifiable chains of trust. So there are two different kinds of SFIDs defined in spiffy today. One is an X519 and SFID. So this is an X519 certificate with certain constraints defined in spiffy. And then there is also a John SFID type, JSON web token. And additionally, this SFID type also uses the standard JWT format but with some additional constraints introduced. So here's an example of an X519 SFID. And I've just highlighted the portions that are relevant to highlight today. So we have an issuer. In this case, this is issued by an organization called spiffy. And we also have a subject. And here this is, the subject is basically an organization called SPIR. And then you'll see in the validity portion here, there's a not before and a not after. And these are actually only an hour apart from one another. So this is actually a really powerful thing to have a short lived identity. And this is something we'll talk about a little bit later with SPIR. But the core thing we want to highlight here is this URI subject alternative name in the certificate. And this contains the spiffy ID of the workload. So this is really what identifies what this certificate is issued for. And this example, it's issued for the service orders in the example.org trust domain. And then here's an example of a JWT SFID. So very similar. We have a subject claim here, the SUB claim, which has the spiffy ID of the service that this token is issued for. We also have an AUD claim, which represents the list of audiences that this token is valid for. And in this case, we're saying that this token is valid for the account service. There's also these other two claims, EXP, which stands for expiration time, and IAT, which stands for issued at. And you'll see here that these values only differ by 300 UNIX seconds here, and 300 seconds is five minutes. So again, this is highlighting that this is a short lived identity, which is really powerful because if the identity was compromised in any form, it would only be valid for up to five minutes in this example. Okay, so we've talked about how do you refer to a workload by its spiffy ID, and how that ID is represented in a cryptograph, the verifiable document called an SFID. Now, how do workloads obtain their identity, or in other words, their SFID. Well, this is defined in spiffy as the workload API. And this is a GRPC based API, which is available over a local UNIX socket endpoint in the recommended deployment form. It can also be deployed as a TCP socket, but in general that is not recommended for some security reasons that I won't really go too much into today, but you can read about it in the spec if you're interested. And this API exposes a few APIs, but these are the two that I'll kind of hone in on today, which are probably the most important to talk about. So one is an x509 SFID API called fetch x509 SFID, and this is how a client would actually get its identity in the form of an x509 SFID. It's a server streaming RPC, meaning that when the client connects, it opens this long live stream over which the server can actually push things later. So we'll see how that works later. And then there's a Jada SFID API, which in this case is a unary RPC, so that means the client will request its identity. And then it'll just receive the response and anytime it needs another identity, it just re requests it from this RPC. Okay, so that's spiffy. So now we basically define these primitives of workload identity, how it's represented and how we can obtain one. And then we use these things to actually enable secure authentication. Well, this is where spire comes in. So spire is actually an open source implementation of the spiffy specifications. And I just want to highlight a few key features about spire that are really interesting, like above and beyond what spiffy defines. One is that it automatically refreshes the identity documents that it issues. And this is really powerful when you combine it with the concept of a short lived identity, because you can actually define these short lived identities in the platform, and spire will automatically rotate them and push the new identity documents directly to the workload that needs them. And additionally spire can run in many environments, including in Kubernetes, or in native VMs in AWS, GCP and Azure, as well as in your own private cloud infrastructure. It supports identification of workloads via several means. One is Unix socket, another is a Docker container. Sorry, Unix process or a Docker container, or a Kubernetes pod. It also has flexibility to be extended into several different ways based on some of the key features that it delivers. You can actually write your own plugins and plug those into spire to adapt it for your own environment or use cases. So I want to kind of quickly go over what the architecture of spire looks like and how the control plane is designed. So there are three components here. On the right we have the spire database, and this contains information about the different types of workloads that can receive identity in this environment. In this case we're talking about a trust domain, which is the logical boundary for a spire deployment. And a spire server deployment here talks to the spire database and manages those workload registrations. And this can be deployed as a highly available active active deployment. And then on the left here we have two different hosts in this example, host A and a host B. And each of these hosts runs a host local agent called the spire agent. And the responsibility of the spire agent is to reach out to the spire server to get workload identities for all the different types of workloads which can run on the same host as that agent. And to cache those identities for the lifetime of those identities and deliver them to the workloads that need them as well as refresh them over time. So this is at a high level, the architecture. Now I just kind of want to walk through the whole flow of how a workload is registered in spire, and then how a workload obtains the identity for that registration, as well as how that identity is rotated and pushed back to the workload. So at the top here we have a component called a workload administrator. And this could be something that is like a deployment engine, which launches containers or launches processes that need identity. It also could be a separate service which uses a source of truth that shares with a workload administrator. So it can look up information about deployments and see where workloads are running. And then it can, what its purpose in this diagram is, is to actually make those registrations in the spire server to designate that these are identities which should be available in this environment. And so in this example where our workload orchestrator is going to deploy an instance of the order service and the account service. And so when it goes to do that deployment, it's going to reach out to the spire server, and it's going to register identities for the orders and accounts services. And then spire server is going to persist those workload registrations in its database. Okay, so this is kind of the registration onboarding flow. Now, on the left here we have two different hosts, a host A and host B again, and each of those is running an instance of a spire agent. So spire agent is, has a secure channel set up with the spire server, and it requests all of the workload identities that it can serve on the host that it's running on. So it's going to send certificate signing requests for all the x509 identities that I can serve. And the spire server will sign those certificate signing requests for all those workload identities and send them back to the agents that requested those signings. Okay, so now the agent has the identities for the workloads cached locally. So now the orchestrator launches instances of the order service in the account service. So the order services running on host A, and the account services running on host B. The order service requests an x509 svid from the spire agent, and the account service does the same on its local host. Now, both of those services get their identities from spire, and they set up the secure channel over mutually authenticated TLS. And now the order service can issue an RPC over the secure channel. And using the spiffy ID present in the x509 svid, the account service can actually verify whether or not the service that is making the request is is authorized to issue that request. So it can basically define an authorization policy which says the only the order service is allowed to access the get user account API, for example. Okay, so great we have this secure authentication now between services. And this is something that's cryptographically verifiable with TLS. And then the next part of the life cycle of the x509 svid is that at some point it's going to expire. So the agent has to request new updated rotated svids for these workloads. So it reaches out to the spire server and asks for newly signed certificates. And the spire server refreshes those x509 identities in the same way that it originally issued them by signing the certificate signing request sent by the agent. And then once the agents get those updated identities, they actually push them to the services which contacted them over the workload API. So earlier we talked about how the fetch x509 svid RPC and the spiffy workload API is a server streaming RPC. So this is where why it's designed that way is so that later on the spire agent can transparently push the rotated identities to the workloads. Okay, so just to kind of focus on the first part of this overall flow that we talked about we had this workload orchestrator which was creating these workload registrations for workloads. So that's really how spire identifies different workloads of Tron on the different hosts in the environment. And this is how we represent registration inspire. And there are three key properties I'm going to focus on one is a spiffy ID. And this is at the top here this represents the workload itself. So this is what the identity is issued for is issued for the service orders here. And there's a concept called a parent ID. And this parent ID is really powerful allows us to group workload registrations to particular agent we're set of agents. So it allows us to effectively control which workloads can run on which hosts. And then the third thing here is something called a selector. This selector is a runtime attribute that we use to identify the workload that is trying to obtain an identity. And so in this case we have like a Docker environment variable selector called service underscore name, which is equal to orders, and a selector called pod name for Kubernetes, which is set to the value of order service and then Kubernetes namespace which is set to production. And so what this means is any service which has is running as a Docker container and has this service name environment variable set to the value of orders and has a pod name orders service and has a Kubernetes namespace of production is entitled to this identity of service orders. And then selectors are powerful and there are several that you can choose from. So other examples here would be like pod attributes and Kubernetes and labels of Docker containers. Okay, so let's talk a little bit about workload attestation. So this is the process of identifying a workload and issuing an identity. So at the top here we have a workload, which reaches the Spire agent over the workload API to request its identity. And this is not on authenticated requests. The workload doesn't provide any sort of like certificate or token or key or anything it just says hey give me an identity. And the Spire agent uses these different configured workload attestor plugins to determine what that workload is. So it's going to discover dynamically run time attributes about the process in order to match those selectors against selectors in the registrations that it knows about. So in this case, there are three plugins configured, there's a UNIX plugin and a Docker plugin and a Kubernetes plugin. And each of these plugins is going to reach out to a trusted authority in our infrastructure. So in the case of a UNIX it's going to reach out to the Linux kernel. In the case of Docker, it's going to reach out to the Docker daemon on the host. And in the case of Kubernetes is going to reach out to the local Kubla agent running on the host. Okay, so the workload contacts the workload API asks for the agent asks for all the selectors about that process from these different plugins. And then it gets a set of all these selectors available about that process. I know that this process is running with this UID of one, two, three, four, five with, which is a user called Bob. And then it has the service name environment variable set in Docker called orders. And it has a pod name in Kubernetes called orders hyphen service and it has a Kubernetes namespace of production. Well, if you're observant you may have seen that these last three selectors actually match the registration we talked about in the previous slide. Because it matches those three selectors, we can actually give this workload the identity for that registration of service orders that we talked about. So the workload API in this example would return the x5 and svid for the example.org slash service slash orders identity. So that's how workload out of station works. Now, what have we seen so far. Okay, this is there's been a lot of concepts we talked about. And how what does it all mean. So we've seen that the orders and the account services are now bootstrapped with cryptographically verifiable identities. And in this case it's an x5 and I'm acid. So this really enables the orders and the account services to be able to trust one another because they can verify the chains of trust associated with those. Using that verification, they can use secure communication using mutually authenticated TLS to encrypt traffic between the orders and account services. And using this Svid concept the accounts service can now actually identify in a very precise and secure way what the workload is on the other side of the channel. Using the spiffy ID present in the Svid. And based on that authentication method, it can actually start to write some authorization policies to say, only service orders can contact this account service for this particular API. And we've seen that these cryptograph keys, which are represented as as fids are actually rotated automatically by spire, and they're pushed transparently to the services on update so they don't need to manage any logic to refresh their identities on any certain cadence. So what's coming next, I just want to quickly talk about upcoming things inspire, which might be interesting to you. So, a major thing that we're focusing on is trying to improve these are experiencing kubernetes to be much more turnkey right now it's a little bit complicated there are a lot of configurations and you need to think about. So we're trying to make that a little bit easier to get going on. We're also trying to support workloads and environments where you cannot run a spire agent and environments that you might think of in this category or like serverless computing, like AWS Lambda for example, where we don't really control the infrastructure that the workload runs on, but the workload may still need an identity to talk to services. So we're trying to export those these cases. We're also working on providing a privileged API, which delegates the responsibility of issuing sfids to workloads to a trusted actor. So an example might be like a proxy on a host where services all route their traffic through a local proxy, and then the proxy determines was the identity of the caller, and how do I propagate that identity and like a TLS connection for example. Then we are also working on improving the support for Federation and Federation is really a concept in spiffy where based on different trust domains, we may have workloads running in a workload running in one trust domain, and a workload running in another trust domain, and they need to be able to talk to each other and trust each other. And that concept is called Federation spiffy. So we're working on providing a configurable API to designate what are the Federation relationships for different workloads and system. We're also working on improving our attestation of supply chain provenance using binary signing verification. We're also trying to enable secret list authentication to GCP using open ID connect Federation. And then in the long term we are working on how do we improve the key revocation and forced rotation with inspire. And then similarly for Azure we're also trying to enable secret list authentication using IDC Federation. We're also working on some production readiness kind of work like improving the health check subsystem to be more robust and improving error messages so that they're more actionable to users. So if you're interested in about spiffy inspire I'd encourage you to attend the talk that's coming up next, which is called bridging the great divide spiffy spire for cross cluster authentication which is being given by a spiffy inspire And it's going to talk a lot more about Federation and how you can leverage those concepts. If you're are interested in the project and more and more I'd encourage you to go to spiffy.io which is the website for spiffy inspire. We also have GitHub repos for each spiffy inspire. And then we also have a slack workspace where we engage with users on a regular basis. That's all I have. So thank you for attending.