 Okay, all right. So I'm Andrew Harding. Like Amir mentioned, I'm a spy. I've been a spy maintainer for a couple of years now, and recently I was made a Spiffy maintainer, participated in that effort for quite some time. Today we're going to have a little bit of a deep dive on Spiffy Inspire. And both Evan and myself are staff engineers at VMware. I'll let Evan introduce himself though. Thank you, Andrew. Yes, I'm Evan. I spoke a little bit earlier on the Spiffy Update thing. You know, the goal for today, the goal for this talk is really, you know, we've got a full day ahead of us, and we're really only just getting started. And, you know, there's likely to be kind of a lot of jargon thrown around and a lot of terms and things like that that might be specific to Spiffy Inspire. And so the goal of this talk is really to kind of familiarize everyone with it and do as deep a dive as is necessary. But we hope, as is necessary, to set folks up to really understand and grow up the presentations for the rest of the day. You know, we also recognize that, you know, this might not be new to some of you. We've tried our best to make the presentation as interesting as possible for those who already may know some of this stuff. And we do have a lot to share with you. So, you know, I hope we don't go too far over time here. But I'm going to, in the interest of time, I'm going to let that be that. And I'll pass it back to Andrew to log into the agenda and kick things off. Okay. So, yeah, like Kevin said, we're just going to do a quick overview to help people maybe understand the rest of the talks that are coming today. We're going to start with a quick Spiffy refresher. They're going to go over the Spiffy Inspire project goals and then dive right into how Spire specifically tries to attain those goals through the agent through no delmerco data station, how Spire manages its keys and rotation strategies. And then we'll talk a little bit about, you know, deployment and avoiding failure modes. So right off the bat, starting with Spiffy. There we go. Kelsey talked about some of these topics. So hopefully this will be a very light refresher. We're going to start off with the Spiffy ID. Again, this is the heart of Spiffy and forms the way that we structure identity for services. And it's a URI and it's got a couple components. The authority component represents the trust domain for the identity and the path component represents, you know, the particular, you know, entity identity within that trust domain. And the trust domain is in Spiffy nomenclature is essentially namespace and provides a boundary. So trust main boundaries can be along security boundaries. This could be like, you know, different environments like production versus staging, or even other sort of workloads or systems that you might have some sort of, you know, requirement around security isolation for. It could also just be as simple as an administrative administrative boundary. Like you've got a couple different teams who want to manage their own independent Spiffy implementations or deployments. And so this could be like, you know, billing versus sales versus human resources. And the idea is that trust domains have signing authorities within them. And those signing authorities are responsible for issuing those secure identities for identity within that trust domain. And the secure identity inside Spiffy is codified in what's known as the Spiffy variable identity document. And this document contains a Spiffy ID and it's again signed by an authority within the trust domain. And we've got specifications out that define this type of document for both X5 and 9 certificates and Jock tokens. So we've got our game, we've got it embedded into a, you know, assigned document. Now let's talk about how we verify those documents. And we did that with materials that are found in what's called the Spiffy bundle. So again, this is a collection of public key material from the authorities for trust domain. And it's used to validate SVITs that belong to that trust domain. If you're reading through, like, the documentation or the specifications, you'll also see this called the trust domain bundle or just trust bundle. These three terms are used interchangeably. So building up, now we've got our ID, we've got assigned document over that ID, we can verify it with bundles. Let's talk about how workloads receive this cryptographic material. And that is done through the Spiffy workload API. So the Spiffy workload API is something that unauthenticated workloads talk to and again provides SVITs and bundle materials and streams new materials to the workload as those materials change. And because it's an authenticated API, in other words, you know, workloads don't have to bring some sort of identity with them or secret with them in order to authenticate against this API. It solves the secret zero problem for the workloads. The last thing we'll talk about in relation to Spiffy is another mechanism to retrieve the bundles for a trust domain. And that is the Federation API. And we've talked about this a few times already today. And again, this is just a very quick way for trust domains to exchange public key material so they can authenticate each other's SVITs. It's a one-way relationship. You know, when you contact this API, you're asserting trust in the bundle material that you pull off that API, but it doesn't go the other way around. So you can authenticate their identities, but the person you're federating with can't authenticate yours unless they also perform this sort of one-way Federation step to obtain your public key materials. So in a nutshell, Spiffy gives us cryptographic verifiable, secret zero solving, frequently rotated, federatable, namespaced, uniform identity. And that's quite a list. It's huge. And you might be asking yourself, well, you know, maybe on my infrastructure runs inside of a very homogeneous environment where I have, you know, maybe some of these checkboxes checked off, all the ones that are important to me. So like, what's really Spiffy bringing to the table? And so for Spiffy and its project goals, it's not about, you know, services that are running in a single cloud environment, or maybe this other cloud environment or that other cloud environment, right? Or this organization, that organization, or get another organization, or maybe your services that are running bare metal or inside virtual machines or running inside these containerized environments. It's really about an identity substrate that provides all of those benefits that can span all these different environments. So that's it for a Spiffy recap. Let's talk about Spire now. Spire's whole goal in life as a Spiffy implementation, the first Spiffy implementation, is really to light up that workload API that we talked about that gives you those, you know, bundle and SFID materials in as many different environments as possible and to provide a sense of uniformity around management of those identities. And it does this starting with the Spire agent. This is kind of the natural place to start for Spire's support of Spiffy because the agent is what lives alongside workloads and implements that workload API and provides those cryptographic materials to workloads in that API. Spire agent itself doesn't start out with any of those materials. Those materials are centrally managed and signed by the Spire server. The Spire server acts as the centralized signing authority inside of your trust domain. And so there's this mechanism through which Spire agent is able to reach out to Spire server and obtain the materials that it will later on feed down to workloads to the workload API and it caches those materials in an internal cache. And so as bundles are prepared and updated by, you know, inside of your trust domain by Spire server and as SFIDs are minted, those are sent down to the Spire agent and cached. And of course, as, you know, those materials change and rotate, Spire agent is able to reach out and continue updating those materials inside of its cache. Again, like making those available to workloads downstream via the workload API. Um, you know, now these are cryptographically signed identities. These are, you know, and they're associated private keys. And so there's, you know, these are security sensitive materials and Spire server isn't just going to hand them out to anybody. And so there's a process through which Spire agent is able to bootstrap and authenticate against Spire server and I'll kick it over to Evan to talk about that next. Thank you, Andrew. Um, yeah. So Andrew mentioned that, you know, Spire helps to solve secret zero problem, which is, you know, how do you get the first secret? How do you get the first credential? You probably don't want to bake it into your workloads or bake it into your nodes and deploy them. So how do you solve this ideally at runtime? And so, you know, Spire and Spiffy both look to solve that problem for workloads. Spire in particular, which, which uses this agent, this agent also has to have a solution to the secret zero problem. So when a new agent comes up or a new node comes up online, how do you know the identity of that node or agent and how do you, in order to like, in turn, authorize the issuance of, of estates to it. So we have this process that is called node attestation. And this is a way that Spire server can, can find out the identity of a new agent or a new node without that agent or node having to have had any kind of thing baked into it or any kind of preexisting secret. You can see on the right, there's one node attestor plugin for the agent. It is platform specific, generally. And I'll let you have Spire server where it has multiple node attestor plugins. So a Spire server can manage agents across different types of infrastructure. You know, in this example, we have AWS and Google Cloud as well as a bare metal TPM based attestor. So I'm gonna rock you through a couple of examples of how this actually works. In the AWS case, the agent tickles this node attestor and this node attestor knows how to reach out and talk to AWS. So in this case, we reach out to the AWS meta data service and we grab a document that is signed by AWS that AWS makes available only to this node. And that document has the instance ID and other identity related information for this particular node. So the agent plugin reaches out and grabs this thing and then shoots it over to Spire server under a TLS protected connection. Spire server receives this document and then passes it down to its node attestor that pairs with the agent one. And this node attestor knows not only how to validate that document that it got from AWS, but also to call AWS APIs and perform an extra set of validations. Is this a new node? There's some anti tampering checks that occur there. You could write whatever logic you want in there really. That's the beauty of this of this pluggable system. But once the Spire server has effectively been convinced that the agent is running on instance ID 1234, for example, Spire server will issue the agent its own Svid. And this Svid identifies the agent uniquely within the trust domain. And this identity that we issue, the agent is derived from the attestation. So in this case, we issue an identity that's a function of this AWS account number on this AWS instance ID. And in order to just demonstrate the flexibility of this node attestation mechanism, I have one more example, which uses a TPM. TPM stands for Trusted Platform Module. It's a small little chip that is soldered onto most motherboards these days. And it provides an enclave, if you've heard that term before, secure enclave for holding private keys and making other kinds of security assertions about the state of the hardware. So there's this node attestor TPM, this TPM base node attestor rather that knows how to reach out to this TPM. And what it can do is it can grab a certificate that is burned into the TPM from its manufacturer. And it sends the certificate over to the Spire server, which is then of course passed to the TPM based node attestor on Spire server. That node attestor is configured with this manufacturer CA certificate. So it knows how to validate the certificate that yes, this is a valid certificate from the TPM manufacturer we expect. Inside that certificate and a lot and a message sent along with it are some public keys. And the private keys that are paired with those public keys are actually burned into this TPM hardware. So what the server does is it is then in a position to issue a challenge. So it can take a little nox or a small little randomly generate a secret. It encrypts it with this public key that it received, which it now knows to be burned into this TPM by way of the certificate it received and sends it back to the agent. So the agent receives this challenge and passes it down into its node attestor plugin, which then passes it back into the TPM to be solved. So the private key that the TPM is holding on to is able to solve this challenge and send back in clear text the bit of information that Spire server generated we can see here. At this point, Spire server has got a pretty good idea of the identity of the hardware that this agent is running on. And it knows that particular key that is burned into this TPM. And same as before, it uses that information to issue an identity to the agent. And in this case, the agent's identity is bound to the identity of this TPM and the hardware that it is running on. So that's as fast as I can tell the story of node attestation. The end result here is that we've gone from an agent or a node that just comes up on the network with no prior knowledge to, okay, now we know exactly the identity of this hardware of this agent that we're communicating with. So that solves the secret zero problem for the agent. But what about the workload, you know, Andrew described that we solve this problem for the workload two via the workload API. So there has to be some, some magic there. And so to solve that, we do a very similar kind of approach and Spire agent that we call workload attestation. So this is able to take a workload that we have no prior knowledge of that it has no credentials and we're able to identify it. So how do we do that? Inside Spire agent, we have this library called peer tracker. Peer tracker is a platform specific implementation of some logic that is able to introspect the kernel that the agent is running on to find out the, to find out like which process is calling it. So when the workload calls this workload API, they do all these special lookups and we're able to figure out the process ID and some other attributes associated with the process. Once we have this information, we pass the process info back into these workload attestors, which are similar to node attestors. One big difference here is that the agent can load multiple workload attestors and we fan out across all of these. So in this example, we have, you know, one for UNIX that knows how to talk to the Linux kernel, we have one for Docker that knows how to talk to the Docker demon, and we have one for Kubernetes that knows how to talk to the Kubla. So we just patch each of these plugins and they go and they collect information about this process that's calling us and they return what's called selectors. These selectors describe that calling process. In this case, we have, you know, a username, we have the Shasam of the workload binary. We have the Docker image ID is in Kubernetes related information. This is pretty much all we need in order to positively identify this workload. What is the shape of this workload exactly? What is the identity? Now we are in a position where we can issue it an SVID, a key, and the associated bundle. So he's done a lot of time so far talking kind of about the agent, how the agent gets an identity, how the agent issues identity to workloads. But of course, there's the server component that Andrew mentioned earlier. It's got to manage these keys. It's got to actually mint these SVIDs. It's got a bunch of responsibilities on its shoulder. So I'll pass it back to Andrew to take a look at some of those internals. Thanks, Evan. All right. So again, SPIR server is the centralized signing authority for SVIDs inside the trust domain. And it accomplishes by having a signing authority for each SVID type. So inside SPIR server, there's a distinct authority for X519 and Jot SVIDs. Now these pairs here, the X519 authority is represented as an X519 CA certificate and it's accompanying private key. This CA certificate can be self-signed. It can also belong to a larger existing PKI scheme and be signed by an authority inside that existing PKI. The Jot authority is just a simple asymmetric key pair. It's in charge of signing the Jot SVIDs. And one thing to note here is that across these two authorities, the private key material is not directly managed by SPIR server itself. And this is a security consideration to kind of separate out the private key management from SPIR server itself and offload that to what is known as the key manager plugin. The key manager plugin is a simple interface that is more or less loosely based off of a subset of PKCS 11. And through this interface, SPIR server manages multiple key slots for private keys and can create these keys and use these key slots to sign arbitrary data. There are a couple of key manager plugins that are built into SPIR. The top two you see there are memory and desk. We've touched on this earlier when Augustine gave his update that there are community efforts in place to develop a TPM based key manager as well as something that hits AWS's key management servers. Getting back, we talked just a second ago about how the X519 authority can be part of a larger existing PKI. And the way that that's accomplished inside SPIR is that the use of an... Well, Evan, I think we just... Let's see. Yep. Sorry. Hold on. I got kicked out of my SSO session right in the middle of our presentation. One second. Is that better? I have my side. I need to share. Hold on. Hold on. Okay. Yep. Let's see. Virtual conferencing at its finest. Yes, yes. Sorry, everyone. Hey, it's real time and it's interactive. Beats pre-retro. Yeah, absolutely. Okay. Where's my mouse? Let's see. Okay. Is that better? All right. I'm back. Perfect. Thank you. All right. So the way that SPIR accomplishes interacting with this sort of existing upstream PKI is through the upstream authority plugin. And again, this is a very simple interface that just provides enough functionality for SPIR to interact with that upstream PKI for the two different authority types that it manages. Specifically, as an X519 authority is prepared, the CSR for that intermediate CA certificate that SPIR wants to be signed is sent upstream through the Mint X519 CA RPC, where it is signed by the upstream authority and then shipped back. And if we want to talk about JOT, as the JOT authority is prepared, the public key material is published upstream, again, through the upstream authority published JOT key RPC. And the idea here is that mechanisms inside that upstream PKI can then disseminate that key to interested parties who want to validate JOT SPIR transmitted by this particular SPIR server instance. There's a whole bunch of upstream CA implementations here. I won't go over most of these, but I will mention this last one, the SPIR upstream authority plugin. Evan's going to dive into the details of this later, but this is essentially where SPIR is acting as the upstream authority for a downstream SPIR server. It enables some interesting resiliency and isolation benefits. So again, we've talked about how these authorities are prepared and maybe participating inside of an upstream PKI, but we haven't really talked about how the public key material from these authorities makes its way back to agents and down through the workloads via the workload API. And so essentially what happens is SPIR manages the storage back end, and this is known as the data store, and it's involved in storing all sorts of different things, which we won't get into now. We're mostly concerned at this point with the trust bundle for the trust domain. As these authorities are prepared, the public key material is appended to the bundle inside the data store. And like we talked about way back when we first introduced the SPIR agent, the SPIR agent is at some frequency connecting to the server and synchronizing down bundle material and getting SFID assigned and rotated. And so as part of that synchronization process, we can see that it pulls the bundle from the data store and directly through the SPIR server. And as the X519 and JOT authorities are rotated, again, the new public key material is appended to that bundle inside the data store over and over as SPIR lives and breathes and does its rotation strategy. And again, those materials are periodically pulled from the data store through the SPIR server down to the SPIR agent and out to the workloads. Now, there's this rotation of X519 and JOT authorities. This happens at a configurable cadence. And we've seen how the public key material from those newly prepared authorities is stuffed inside the data store and eventually makes its way down to agents. But because the agent is not getting a continuous stream of updates and is kind of pulling in at some frequency, there's an interval in which an X519 authority has been prepared and its public key material has been published to the data store, but where an agent has yet to pull that data store for that key material. And so at that point, you can imagine that if that newly prepared authority was immediately start to immediately assigned to start minting ESFIDS, that if those ESFIDS made their way to an agent who has yet to pull for that bundle update, the agent would be unable to verify those ESFIDS. And so SPIR implements an interesting rotation strategy to kind of prevent this and mitigate this situation. And it accomplishes this by actually having two authorities per ESFIDS type. So I lied a little bit early in the presentation. The first set of authorities is considered the active set. And the active set sits alongside the prepared set. And the active set is the one that is involved in signing the ESFIDS. So any authority that's in that active slot is the authority that's chosen to sign incoming ESFIDS requests. And of course, its public key material exists in the data store and is propagated out to agents as they sync. And at some point, maybe these authorities are going to retire somewhat soon. SPIR is going to decide it needs to rotate in a new authority key pair for both the X519 and JOT authorities. But when it does so, it doesn't just replace the active authorities. Instead, it prepares a new set of authorities in advance and sticks those in the prepared slot. And during this time, of course, the public key material is added to the bundle. And the active slot is still the one in charge of minting these different identities. And so there's an interval of time here where our active slot is minting identities. And our prepared slot has been prepared in its public key material placed in the bundle where it's now propagating down to agents in advance of ever being used to sign ESFIDS. So after ample time has elapsed, that has allowed the public key material for the newly prepared authorities to propagate throughout the system and make its way all the way down through agents and into workloads. That active key pair can then be retired. Now the bundle material stays the same. We don't prune out the old authority key material quite yet. We want to leave it in there for a bit of time because there could still be, first of all, the rotation event happens or the activation step happens before that old authority has actually expired. And so there could still be ESFIDS floating around inside your system that have been signed with that old now retired authority that you'd still need to validate. So we give some time before we end up pruning that key material out of the bundle. And again, this process is repeated as Spire continuously monitors and rotates these authorities to maintain freshly rotated authority material. The cadence that we do this at is, it's a pretty simple strategy. It's based on how much time is left on the active authority. So when the active authority has half of its lifetime left, that's when we go ahead and prepare a new authority. And then when the active authority has one sixth of its lifetime left, that's when we go ahead and activate the new authority. And you can imagine a space of time between that halfway mark and that one sixth mark, that is the time that we give that prepared bundle material to propagate out to the trust domain before we start minting ESFIDS with the newly prepared authority. And there's some, some caps in there to like prevent, prevent some weird times with really long live authorities. But I'm happy to talk about that. I won't take the time to talk about that right now. Let's see. So we've talked a lot about, you know, Spire's responsibility, Spire server's responsibilities in particular. It's doing a lot of different things. It's signing ESFIDS. It's rotating authorities. It's publishing stuff upstream. And it's obviously a big, a big point of failure. If something goes wrong, Spire server that has large implications on the, you know, on their ability to put, to push identity out to our workloads. And so Evan next is going to talk about, sorry, I'm messing around with slides here. Evan's going to take a minute to talk about, you know, some deployment strategies for Spire to try and mitigate those failure modes. Thanks, Andrew. I'm going to speed through this as fast as possible because we're already well over our, our a lot of time here. So, you know, this is kind of the, the simplest deployment we can imagine, right. And as Andrew mentioned, Spire server becoming unavailable is particularly problematic if all the workloads are depending on having valid assets to communicate with each other. The good news is that there, this is not the most terrible thing in the world of the Spire server where it could fail. I mean, it's not ideal. But, you know, Andrew mentioned previously that the Spire agent does have a cache and the agent knows what, what identities it can issue and it fetches those in an advance and it caches them. So, the Spire server goes away. You know, we can't get new ESFIDS, we can't rotate expiring ESFIDS, but the agent can still perform workload attestation and can still serve ESFIDS to workloads from its cache without contacting the server. So, you know, it's survivable in a steady state, but again, it's not ideal. The very simple kind of approach just to addressing this is to scale the Spire servers horizontally. You can have as many of them as you like that this obviously addresses performance issues as well. We don't have any, any notion of active or passive, each server, you know, has the full authority. And they do have kind of a shared, shared data store though. If you were to say, hey, I want to put one of these in each, in like different failure domains, like one in each region or one in each availability zone or something, you know, having this to stripe a data store across those things is not ideal. And so another tool we have in our tool chest is what we call nested Spire. I've had a couple mentions of it today. This is where Spire uses another Spire server as its upstream authority. Downstream Spire servers do know attestation and workload attestation the same as a regular workload does. So, you could have, for instance, one Spire server in AWS and another Spire server in Azure and both of them kind of roll up to this global root level. So this allows you to scale across these different failure domains and to manage the failure of, of, you know, different tiers of Spire servers. So if this global kind of tier were to go away for some reason, you know, the local Spire servers can still perform signing operations, can still rotate workload attestants, they cannot rotate their own signing keys like their job keys or CA certificates. So, you know, that's, that's what we look to the upstream servers for. But, you know, you can imagine if you have a one-week lifetime on those or a two-week lifetime on those, you've got a significant amount of time to get that central kind of cluster back up and running. The final tool we have in our tool trust is federation. You know, this is where we have a different set of Spire servers that is in a completely different trust domain as a completely different set of authorities. And then the Spire servers between each other, they exchange public keys. This is good for managing failure domains. It's also good for managing security domains. If the Spire server and trust domain bar on the right-hand side here were to go down, it does not affect identity issues and trust domain through. If it were to be compromised, it also doesn't affect security of identity issues and trust domain through. So in summary, very quickly, we learned about all these major kind of Spire code paths. This is all that the major works that are really important to understand, like how Spire works under the covers. And we learned particularly about node and workload attestation with Spire, how we go from not knowing who anything is to knowing what things are. We learned about key management and rotation, how all of that is managed by Spire server and how Spire agent receives those and figures out which workloads to give them to. And we also learned about some of the Spire failure modes and different deployment patterns and techniques that you can use to mitigate some of the concerns that come up with this kind of technology. If you want to learn more, you can check out the Spiffy website at spiffy.io. These are the two main GitHub repos. Spiffy and Inspire GitHub repos. We also have our Slack channel. It's a very welcoming community, so we hope to see you there. I know we're already very much over time, so we'll take questions in the chat. I'll be there and Andrew will be there to answer them ASync. And we also have a session later today, a networking session, in which we can kind of talk about anything that might have come to mind during this presentation. So thank you for bearing with us in this much longer than planned talk, and we really, really hope that it's helped to kind of set the stage for the rest of the day. Thank you, everyone. Yeah, thank you.