 UPCON North America 2020 virtual. This is a maintainer track session with you, Andres Vega. I work at VMware as product line manager for TANASU Foundation Services. I also happen to be a contributor to the Spiffy Inspire projects. I worked alongside the maintainers, helping out in aspects of project, product and program management. I'll be talking to you today about the term we've recently started referring to the Spiffy Inspire as, which is the production identity control plane. And along with that, recommended practices for Spiffy Inspire at scale. To start out, I'd like to thank the group of people from the community that recently got together as part of the Spiffy Inspire Bookspot. These are leading practitioners and subject matter experts that over the course of two weeks, wrote a 200 page book on deploying and operating Spiffy Inspire at scale. Among the people, Emily Fox from Sick Security, Evan Gilman from VMware, the crew at HPE, Dan Feldman, AmeriCon, Max Lambert, Agustin Martinez, Ian Haken from Netflix, Fred Coutts from NSM, Brandon Lum from IBM, Eli Nesteroff from Bite Dance and Michael Wardrobe. In terms of the agenda, we're gonna start off with motivation for production identity. We do a quick recap of fundamentals and concepts of Spiffy Inspire, move into deployment primitives for Spiffy Inspire at scale. Once we've gone over that, we're gonna move into day one aspects of managing registration and identities, managing registration entries and managing identities. How do you enable your software applications to be Spiffy ready, be it natively or through a proxy and get into how do you actually roll out or stage out Spiffy Inspire in your organization, throughout your organization? To set the stage, the problem that Spiffy Inspire solve is that of secure introduction. If you look at the landscape of modern applications, these are composed by a larger number of smaller pieces of software. Managing secrets at scale requires effective access control. Implementing that access control requires a strong bedrock of identity. Proving identity has required possession of a secret. Now, the challenge is protecting one secret requires coming up with some other way to protect that secret, which then requires protecting that other secret and so on. Ian Hawken from Netflix likes to allude to infinite regression as an analogy to the problem or solving the bottom turtle. If you think of turning on access control to a resource, be it a database, be it a service, that will require a secret such as an API key or password. Now, that API key or password needs to be protected. So you can protect it with encryption, but then you still need to worry about the secret decryption key and how do you secure that? You could put that into a secret store, but then you still need some other form of credential like a password or API key to access that secret store. Ultimately, however you end up protecting access to that secret results in yet another secret. Secret stores are great, but they do present a challenge once you have to prove who you are to enter the secret store and then seal something from there. There are a number of other PKI and authentication pain points and modern applications. You need to ask yourself the question, how are certificates and passwords generated and who's responsible for that? How are those securely distributed to the applications that meet them? How is access to private keys and passwords restricted? How are the secret stores stored such that they don't leak into backups? How are these secrets stored such that they don't leak into backups? What happens when a certificate expires or a password must be changed? Is that rotation process disruptive? How many of these tasks necessarily involve a human operator today? So with Spiffy Inspire, we've proposed with the Spiffy specification a standard set of interfaces. By interfaces, I mean APIs and documents for proving, validating and obtaining a service or a workload identity. I will be using service and workload interchangeably. Now, Spire is the software implementation of this as specifications that creates a tool chain for establishing trust between software systems. There are a number of reasons why you want to use the Spiffy Inspire. Spiffy can be used as the basis for a product you're developing and requires transport layer security. Those transport layer security features and the user management unauthentication can be dropped in through Spiffy in one fell swoop. You can replace the need for managing and issuing API tokens or platform access bringing rotation in for free and eliminating the burden from storing and managing access to those tokens. You can deliver MTLS across untrusted environments without the need to exchange secrets. Security and administrative boundaries can be easily delineated and communication can occur across those boundaries when and where policy allows. To reduce the likelihood over breach through credential compromise, Spire provides a strongly attested identity for authentication across the entire infrastructure. And Spiffy Inspire address security needs by enabling pervasive MTLS to deliver communications between workloads no matter where they are they deployed, anywhere in a heterogeneous environment. So let's take a look closer at the Spiffy spec. Spiffy in a turtle shell since we're talking turtles all the way down and reaching to that bottom turtle. The Spiffy ID is the representation of a service name that's the identity software is going to be issued. That is going to be done through a S-fit, a Spiffy verifiable identity document that can either come in the flavor of a JSON web token or a X509 certificate. Then there's the Spiffy workload API which is the interface that services will obtain their identities without any prior secret being required. There's the trust bundle which is a collection of public keys given by an issuing authority for a service to cross authenticate to others. And then there's the Spiffy Federation which is a simple mechanism by which trust bundles can be shared. One additional construct that will be useful to know when we reason about Spire deployments is that of a trust domain. A trust domain corresponds to the top level root of trust of a system. This can be modeled after a individual developer working on a Kubernetes namespace carved out for themselves. It could be an entire team or larger organization. It can be an environment such as Dev, Stage or Prod. It can be an entire business unit running their own independent infrastructure. Now all workloads in the same trust domain will be issued identity documents that can be verified against the root keys of the trust domain. The workload identifiers portion of the ID identifies a particular workload within a particular trust domain. So that's for Spiffy. Let's look at some of the Spire components once this is actually running. We've got the Spire server and the Spire server is responsible for managing all the identities in the trust domain. Think of it as a global identity directory. The server also happens to be in possession of the Aspect signing keys and it is considered a critical security component. Special considerations should be paid when deciding where does the server go. The server uses a data store to keep track of its current registration entries as well as the status of Svits it's issued. There are a number of backing databases that are supported. You should refer to the official documentation for what those are. Now, all the Svits in a trust domain are signed by the Spire server and these are signed by default using a self-signed certificate unless there's an upstream certificate authority that exists and it's been configured through a plugin interface. In many cases, a self-signed randomly generated key is fine. However, for larger installations and production environments, it might be desirable to take advantage of pre-existing CAs and the hierarchical nature of X509s to make multiple Spire servers. Then we got the Spire agent. The Spire agent serves a very simple function but a very important one and that is to expose the workload API to workloads. There is no active management required to operationalize agents. They require a config file and that config file tells the agent what trust domain it's part of and what workloads can call directly to get identities. The entire architecture of the system for both server and agent is based off plugins. There is the upstream authority I mentioned. There's like node attester plugins, node resolvers, data store plugins, key managers, whether for disk or in memory. The agent has the workload attester plugin. So these are perfectly extensible. A deep dive from the plugins is outside of the scope of this presentation. I do encourage you to check the official documentation on spiffy.io to learn more about those. Now, node attestation is the process of establishing trust from the server to the machine and agent fronts. In a cloud environment, it is considered a best practice to verify that node against metadata available from the cloud provider. Spire does provide custom node attestors designed specifically for the cloud that you may run on. Most cloud providers have an API that can be used to identify the API caller. Node attestors and resolver plugins are available for AWS, Azure and Google Cloud. The attester plugin, the purpose it serves is to attest the node before issuing an identity to the spire agent running on that node. Once it has verified the integrity and authenticity of the machine, the spire server will have a resolver plugin installed which allows selectors to be created that match against the node metadata. The available metadata is cloud-specific. If an environment does not provide a hardware root of trust or ability to attest the node, it is possible to bootstrap using a join token. However, doing so provides a very limited set of assurances depending on the process through which that be done. As a second stage, we have workload attestation. Workload attestation is the process of determining the workload identity that will result in an identity document being issued and delivered. That occurs whenever a workload calls and establishes a connection to the spiffy workload API and from there on is driven by a set of plugins on the spire agent. Each attester plugin is responsible for introspecting the caller and generating a set of selectors for it. This happens out of bound. One plugin will introspect the kernel details and look at information such as user and group that the process is running as. While a separate plugin will communicate with Kubernetes and generate selectors such as namespace and service account that the process is running in and a third plugin may communicate with Docker to get Docker labels, Docker image ID and container environment variables. Now that we've done that recap, let's look at some of the security boundaries of a spire deployment. The first boundary is that of the workload to the agent. The agent does not trust the workload to give any kind of input. The attestation is performed through out of band checks as we covered in the previous slide. Any selector whose value could be manipulated by the workload as it itself inherently insecure. It is expected that a security mechanism beyond spire would exist to provide isolation between Linux user permissions and containerization. Then we have the boundary between agent and server and explicit design goal spire is that spire will survive node compromises. Agents have the ability to create and manage the identity on the workload behalf, but at the same time limit the power and agent to do so strictly to complete that task. And in order to mitigate the impact of an eventual compromise, spire requires knowledge of where a particular workload is authorized to run. A node cannot just get information for any workload unless it's been designated to run there. Agents must be able to prove ownership of a registration entry before they can obtain an identity to it. And as a result, they cannot get arbitrary identities for something else in the trust domain unless it's supposed to run there. It is important to note that communication between the spire server and the spire agent can use TLS and mutual TLS at different points in time during the attestation process. Once trust has been established, TLS takes over and all communications are secure. The last boundary is in environments that have been horizontally scaled and you have more than one server where you have two, three, four spire servers to manage load. Spire servers are trusted only to issue SVITs within the trust domain they directly manage. When spire servers federate with each other and exchange public key information, the keys they receive remain scoped to the trust domain they were received from. Unlike WebPKI, Spiffy does not simply throw all the public keys in a big mixed bag. Their result is that if a compromise in a foreign trust domain will not result in the ability to issue SVITs in the local trust domain. It should be noted that spire servers do not have multi-party protection. Every spire server in a trust domain has access to signing keys with which it can issue SVITs. The security boundaries that exists between servers are limited to servers in different trust domains and does not apply to servers within the same trust domain. Having covered that, let's start to look at some of the topologies and building blocks we can build large scale deployments. Eventually, we wanna get to something like the picture where there are objectives for compliance and regulation where you have multiple teams, where you have multiple platforms and you have varying requirements. So start off with the most basic deployment which is a single trust domain. You may wanna do this where there's no administrative boundaries, where you want to have one large big trust domain and it will reduce the total number of distinct deployments to manage. On the right hand side, you can take a look at how this would look in Kubernetes. You can have multiple Kubernetes clusters all managed by the same spire server. As mentioned previously, servers can be horizontally scaled be it for redundancy or for distributing load. All servers in a given trust domain will read and write to the same shared data store. Now as deployments grow in size and number of deployment grows and there are multiple administrative boundaries and perhaps multiple cloud provider environments, you wanna look at a nested spire deployment. A nested topology allows for communication between the agents and the server to occur within their own data center or their own region. In this particular configuration, the top tier spire servers hold root certificates and keys and the downstream servers, the ones you see for the two regions will request intermediate signing certificates. If the top tier happens to go down, intermediate servers continue to operate. So there's quite a bit of resiliency in here. Again, this is well suited for multi-cloud, multi-region deployments where replicating a data store across regions would be cumbersome. Due to the ability to mix and match noted testers for different platforms, the downstream server can very well reside and provide identities for workloads that have different selection and identification criteria when it comes to attestation. Another primitive is that of a federation. We talked about federation to spiffy spec a little bit, but federation allows you to establish trust between multiple spire deployments or from spire to any other spiffy compatible CA. When deployments have multiple root of trusts based off like different companies trying to establish a business relationship and don't wanna create a VPN connection from one to the other, you may leverage something like federation. Now, that is one of many use cases, but again, interoperability between a spire deployment and a service match that uses spiffy may be another motivation to do so. Now, from putting this into applications, there's two ways to go about it. The first is enabling software natively. It will require making modifications and it can be introduced through a common library or framework used across application services. The native integration is the best approach for data plane services that are sensitive to latency or services that want to utilize identity directly at an application layer. Spiffy does provide libraries for the goal programming language and the Java programming language to facilitate development of spiffy aware or spiffy enabled workloads. You can find the links to the respective language libraries here on the slide. Now, places where the native approach may not be feasible, be it because the cost is high or the service is running third party code that cannot be modified, a approach there as to front the application with a proxy that supports spiffy. That could be something like Envoy proxy or that could be a spiffy helper that will watch the API and reconfigure the workload as certificates change. Certainly Envoy through the secret discovery exchange it's a popular approach for adding spiffy, for enabling spiffy to the application without having to make any code changes to the application. Also popular when you have polyglot environments. Now let's look at managing registration entries. So the Spire server supports two different ways to add registration entries via the command line interface or the registration API. The API allows admin only access. Once we have a deployment in place, there are no registrations in there. So a bootstrap process is needed. That is often done by an operator or a deployment process. In practice, that doesn't really scale for scale deployments or high growth deployments. You want to remove manual processes from the picture. You also want to make sure that if dynamics are dynamically scheduled or elastically scaled, you can keep up with those changes. So you can see here in the picture why an automated process of a particular like opaque entries or a particular schema you came up for modeling of the identities gets handled by the workload orchestrator and automatically creates on the server the registration entries as workload since the shake. I like to credit Michael Wardrobe for the mental model of reasoning about rolling out a deployment between independent islands and rich islands. The independent island model allows for individual trust domains to operate independently of one another. This is often the assist option when you have multiple apps and you have to start by what can you target that you can turn spiffy on and evolve from there. However, these islands may not necessarily have knowledge of each other. There may not be any communication in between those. Another model that occurs is independent spiffy deployments get bridged by Federation and enabling services from each island to trust each other, but there's still no communication between spiffy and non-spiffy islands. As this continues to grow like the bridge islands model can also use like a gateway to talk from a non-spiffy service to spiffy service. D-Diaphone on this subject is outside of the scope of the presentation, but certainly something to check out and solving the bottom turtle book where we go at length and to how to reason about this different models and formulate a strategy around it. There are many other considerations for scale. Certainly haven't covered them all. Apologies for the typo on that comma first bullet point in the section. There is low testing and capacity planning to be performed. You may want to choose the security over availability trade-off when determining what's going to be the time to live with certificates. How are those if any revocated or renewed? It's important to keep in mind that short TTLs reduce the time window of a credential being exfiltrated, the utility time someone could use that, but do require a quicker response against an outage. You may want to reason about failure domains and blast radius and picking one to apologize over the other. You may want to consider using HSMs and looking at logging and monitoring. That said, great place to go for additional information is spiffy.io, check the docs. If you're ready to get started on the project, have any questions are eager to contribute. You're more than welcome to join the group on Slack. Check out the code on GitHub. It is a very active and vibrant community around the project and we're always looking for newcomers and new participation. Awesome. With that said, I wrap it up. I think I am right about time, few glitches during the presentation, but I think despite my rinky-tinky set up I managed to do all right. Hopefully you find the talk informational and you enjoyed the rest of the event. See you all online. Thank you.