 My name is Evan Gilman. Thank you for coming. Before we start, I figure I'd tell you a little bit about myself. I started my career in academia originally, doing like network engineering, did all sorts of stuff, systems administration, network storage, all kinds of things. I've been bouncing around the Bay Area for the last five years or so now. And currently at this company called SciTale, where I've been focusing full time on Spiffy Inspire, which we'll talk about today. So I figure first we would talk a little bit about cloud-native network security, kind of what that landscape looks like, how previous patterns fit onto the current cloud-native architecture. We'll talk a little bit about Spiffy, the problems it solves, what it aims to do. A little bit about Spire, kind of from a high level, just how it functions. And then to give you a clearer idea of that, we're going to do like a day in the life of a Spire deployment, sort of speak, so we'll do a little bit of a walk-through there. And then time permitting at the very end, I have a live demo prepared for you all. So I think that if you're at this conference, you're probably aware that today's networks look a lot different than they used to. And a result of this is that the security requirements of those networks change a whole lot. When network resources come and go, a lot more frequently than they used to. And a lot of systems support multi-tenancy. And the previous, like, the network security patterns of your, particularly perimeter patterns, are really poorly suited for this kind of thing. Multi-tenancy and highly dynamic workloads really kind of throw a wrench into that whole thing that we've been doing for a while now. Another thing here is that we don't really have complete control over the network any longer, particularly if you're in the public cloud. You may or may not have routing controls, depending on your provider. And at the same time, you still want to apply fine-grained policy, perhaps even more fine-grained than you have in the past. So I guess all of this comes together to say that the perimeter security model in and of itself hasn't aged particularly well. And it's certainly not able to solve for the cloud-native architectures that we see today. At the same time, software is eating the world. This is a quote by Mark Andreessen, who's a notable Silicon Valley investor. And this is definitely true for cloud-native network security space. The controls we see security controls moving up into software out of hardware and out of the network, where they have been historically for literally decades. There are lots of really great projects which are aimed to kind of solve for this network security automation. But when you do this, when you solve for that problem, you kind of have to choose an identifier. You need a way to describe the things that you're writing policy against, right? And most compute providers, most cloud providers, have their own identifiers. A security group is the example here. But the only problem is that this isn't very well generalized. It's specific to that cloud provider. If you ever need to move cloud providers or you ever turn up an on-prem, anything like this, your security system will fall apart if you've based yourself on primitives which are cloud-specific. So while some folks do it, and it's okay if you'll never ever move, it's kind of ill-advised to start in this way. But going up one level, we have IP addresses, right? And so we're getting a little bit closer to what we're looking for here, because IP addresses are fairly generalized. They'll apply kind of across the board, you know? And most projects today which provide this network security automation we're talking about are using these IP-based ACLs. And if they're not exposed to the first class that it's in, they're kind of going on under the covers, you know? You may use one thing, but then there's a level of indirection, and then there's still IP tables, rules, and all sorts of other stuff. Layer 3 and layer 4 policy gets instantiated underneath. And while this is largely a good thing, it doesn't come without its pains. For instance, these ACLs that you build can grow quite large. Thousands or tens of thousands of rules. Now, particularly if you're doing host-based routes, you know, slash 32s, and you're also doing layer 4 stuff, so like this port, so each individual host may have tens of rules applied to it. And then it's also kind of dependent on a network topology. So if you have NAT anywhere you're doing any kind of thing like this, how do you know which IP address translates where? And particularly in cloud, a lot of this stuff is out of your control. There's not a whole lot you can do about it. So on that note, you know, if you don't control the network, you know, how easy is it to trust that network? Because what we're talking about here, after all, is like a security control. And this IP guarantee that you build security controls around is a really, really weak guarantee. And manipulation is practically undetectable by either endpoint. If the IP address is changed in the middle or something has happened, you don't really know. So it's hard to build security policies, accurately describe things using these IP addresses, and you basically have to trust the network to do the right thing in this case. So putting that problem aside for a moment, I do think we should think more critically about these identifiers that we've chosen. And when we use these IP layer 3, layer 4 identifiers, what are we actually modeling here? I think realistically we're modeling host-to-host interaction. But is this what we were really intending to do? When thinking about policy, you generally write policy as I want A to talk to B. But A and B are practically always software services, and many services can run on the same host. Kubernetes does an okay job at approximating this by giving one IP address per pod. And again, we're getting kind of closer to what we're looking for, but it's still a fairly weak assertion. One IP address can represent many things even in Kubernetes pod. There can be lots of things running in that pod that share the same IP address. So we really want to be more granular about this. What we really want to describe is not host-to-host communication, the process-to-process communication. And the reality is that we're trying to describe process communication, but in the whole time we've been using these IP-based ACLs to do this. And so the point I'm trying to drive here is that IP addresses are really, really bad at describing particular workloads. It's just not the right thing. But going back to the identifiers we have available to us, we can kind of see that there's one missing here. So how do you identify a software process, particularly in this brave new world of highly-dynamic heterogeneous systems? So this problem is known as the workload identity problem. And there are lots of folks working on this problem. Kubernetes SIGO, the Container Identity Working Group is working on this problem. The Istio folks are working on this problem. And of course myself and the rest of the Cytil folks are working on this problem. And that's just within the cloud-native umbrella. We're all working together, but there are folks outside of this umbrella who are also working on this problem. And the direction that we're going thus far, there's only one thing, kind of nuanced point here, is that the solutions that we're coming up with are largely dependent on the underlying orchestration systems. So what we end up with is these different software platforms that have different ideas about what workload identity is. The workload identity that you get from Kubernetes is a fundamentally different identity than the one that you would get under DCOS, for instance. And it can be really, really difficult to rectify the differences between these identities. And so there are tools to accomplish these things, but again, they're very focused on their particular deployments. And as a result of that, none of them interoperate with each other. And furthermore, none of them are super robust either, because at the end of the day, they're kind of tangential to the projects that chose them. Kubernetes is looking to schedule containers. There's not necessarily looking to mint certificates or do other things like this. So the result is not only interoperability problems, but also that these things rarely meet the user's needs, because they're not super flexible, and it's just kind of like a necessary evil that must be written into this project. So what we really need to solve this problem is what we call universal workload identity. The idea that a workload identity which can be understood by everyone. And not only can you start to solve authentication problems and policy problems with universal workload identity, but you can solve other problems as well. Tracing, metering, auditing, all these things become a whole lot easier. If you have a stable identity paradigm that can apply across different types of software stacks, and you can trace requests as they traverse different ecosystems. So it's really interesting and promising. So, and here's Spiffy. This is the problem that Spiffy solves. It's workload identity that is standardized, it is on level playing fields that will work across different software stacks. And really at its core, Spiffy is just a set of open specifications. And these specifications are community driven. It's not something, you know, PsyTill didn't wake up one day and say we need to write a spec for this thing. This is a bunch of folks who come together from many, many companies who've kind of the consensus upon this is this is the way that we think generalized identity should look at, act and breathe. So these specifications, they address a number of things, but primarily the goal here is to address the issuance, the validation, and the interoperability of workload identity. And we can do this by first defining a standard way to prove workload identity. And second providing a standard way to both obtain and validate these identities. So there are many specifications under the Spiffy umbrella. Today we're going to talk about three of them. The first one outlines what we call a Spiffy ID. And this is what the Spiffy ID looks like. It's modeled as a URI, which is basically just a structured string. And the spec isn't too prescriptive about how you model this thing, how you compose this URI. You can choose any name or path you wish. And that acts mostly as a primary key. So it can kind of be thought of as a user name, but for workloads rather than humans. And it's not very interesting in and of itself, but it provides us a really flexible starting point. Next, we introduce this thing called the Spiffy Verifiable Identity Document, or Svid for short. And this is how we prove identity to our emote system. And when we talk about a document which can fulfill this thing, what we're really looking for is something called a proof of possession property. And luckily there's lots of existing document types that have the proof of possession property. So we don't really need to reinvent the wheel on this one. And as a result, Spiffy uses X509 certificate as the underlying document type for Svid. We looked at many different document types in this process. X509 made a lot of sense to start with for a lot of reasons. First, it's widely deployed. Second, it's widely understood. This behavior is widely understood. And finally, it's a model that just works well for the Spiffy problem space in general. So what these specifications really talk about is how to encode a Spiffy ID into an X509 certificate, specifically like which fields to use, et cetera, and how to validate the certificate and the Spiffy ID inside of it. So with Spiffy ID and Svid defined, we can start to begin to address this universal identity problem, right? We have an identity format and a document which is generalized and cross-cutting. It can be used anywhere, regardless of the underlying software platform. But there's still one missing piece here, which is how does workload identity get issued in the first place, right? Where does this Svid thing come from? And to answer that, Spiffy has another specification called the Workload API specification. And the Workload API is a standardized node-local interface, right? So when a workload first starts, it will contact this API. And this API does some work to identify the caller and then return the correct Svids to it, right? It will also return to trust bundles at the same time so the workload knows which other things it should be trusting. And at this point, the workload is ready to serve traffic, because there's traffic using this Svid that's been issued, actually. And so standardizing on this Svid, not just the Svid, but also the Workload API, a super-duper valuable thing. Because what it does is it ensures that identity is available in a platform agnostic way. It doesn't matter if you're running on metal or AWS or GCP or whatever. A Spiffy compliant workload should be able to run anywhere. You pick it up and drop it anywhere. This identity will be available, right? So this paves the way for truly heterogeneous infrastructure. So clearly, Spiffy compliance is a pretty powerful thing. But you might be asking, like, okay, now what? What exactly do I need to do? Do I Spiffy this thing? What exposes this Workload API? How do these Svids get minted and distributed to where they need to be? So this is where Spire comes in. Spire is an open-source project which implements the Spiffy standards. Namely, it exposes this Workload API and a framework for Svid management, management of issuance of these Svids. And I should note that, you know, Spiffy is, as I mentioned before, Spiffy is only a set of specifications. So anyone can make a Spiffy implementation, right? As an example, Istio implements Spiffy, right? So Spire is just one implementation. And different than some other implementation, Spire is a standalone project which is focused purely on this problem of universal identity. And it's fairly thoughtfully designed with an extensive plugin system which allows it to operate in all these different environments even if you have some custom through in-house. Of course, automatically issuing certificates is fairly tricky business. So Spire also necessarily solves this problem of secure introduction and trust bootstrapping which we'll talk a little bit about later. So this is kind of Spire high-level. Spire comprises two core components, the agent and the server. The agent is responsible for exposing this Workload API on every host. It's also responsible for doing what we call Workload Attestation which is verifying the authenticity of the caller and making sure that it is who we think it is. Spire server is responsible for as an issuance of the minting of these certificates. It's responsible for node attestation or measuring the authenticity of new nodes that are coming online. As well as registering these identity mappings, basically saying who should get what Spiffy ID. And to demonstrate how this kind of works, I think it's good to step through a worked example of this. So the first step, of course, is to deploy Spire server. It can run pretty much anywhere. It has no special requirements of its own. And on the first boot, it generates a self-signed certificate. And this is a certificate that we'll use to sign SVIDs for all the workloads which are managed underneath this particular server. We also have a plugin called an upstream CA plugin. So if you have an existing PKI system, we can integrate with this as well through this upstream CA plugin, but it is not required. At the same time, we turn on this registration API. And this registration API is used to configure these identity mappings, who gets what, essentially. And these properties, in order to make that mapping, we need to be able to describe a workload in a node in a particular way. So we use these properties, properties we call selectors, our natural properties of the nodes and the workloads that we're trying to describe. So as an example, this is one registry, what we call registration mapping. So this is saying that in Kubernetes cluster foo, in the namespace operations and the service account media wiki, workloads with this particular Docker image ID will get spiffy id op slash wiki. And since this is an API, it's really easy to automate. So this can be a human putting this information in, or you can have automation systems which are kind of curating these mappings and updating these mappings as things change in your environment. So as soon as the server is up, it's ready to take on agents. And for this part of the example, I kind of have to choose a provider, so I chose AWS. And the server doesn't necessarily have to run AWS as I mentioned before. But let's say we turn up EC2VM and the first thing to start on EC2VM will generally be a Spire agent. And the first thing that the agent does is perform what we call node attestation. It's simply stated, this means strongly proved that I'm a valid AWS server, I belong to you, I'm authorized to be here. In order to do this, you have to leverage what we call a trusted third party. And in this case, our trusted third party is actually AWS. So the agent exercises this node attestor plugin, which is another plugin. And this plugin is platform aware. So it knows how to speak to AWS and gather the necessary proof that it is the machine it claims to be. And so for AWS, the AWS case means this is probably an instance identity document. It then presents this instance identity document to the Spire server. And it does this using regular server verified TLS. If there's a hardware here, this might be a TPM quote instead of an IID. So you can see how this is fairly flexible. You can change different types of documents here. Once the server receives this proof, it has to validate it in some way, right? So in our case, this means calling the AWS API. And additional checks can be done at this time, too. Like is it a fresh boot? Was the NICN, the hard drive just attached to it recently? Like depending on your risk profile, you can look up all these different properties and make all these different checks, right? So once AWS acknowledges this, you know that the instance identity document is valid one, all of your custom checks pass. Finally, an SVID gets issued to the agent. So SVID represents the identity of the agent itself. And everything happening after this, after this node attestation has been completed, occurs with mutual TLS using this agent identity as the client certificate. So at this point, a SPIR agent is fully bootstrapped and the next thing that it does is turn on Workload API. And the Workload API is accessible by any process on the node. And crucially, perhaps a little bit counter-intuitively, it's actually unauthenticated. And this is a really important property for bootstrapping trust because otherwise we'd have to somehow inject credentials into this workload. And it turns out that that's kind of the hard part, right? But we obviously still want some control. We don't want to just give certificates out to any willy-nilly process who asks for them on this socket. So instead of direct authentication, the agent performs this out-of-band process interrogation, which enables it to identify and verify the workload authenticity without the use of direct credentials. So for instance, let's say that a workload calls the agent socket and requests an SFID. On Linux, the agent will interrogate the kernel for information about the caller. So first it will figure out what's the process ID of this thing which is calling me. And from there it can discover many, many other properties, things like UID, GID, even the SHA of the binary, which is running. And these properties that are discovered at this time are the selectors that we saw in that previous mapping. And these selectors are properties that get returned to us from another plug-in that we call the workload attestor plug-in. The agent supports multiple workload attestor plug-ins that can be mixed and matched. And so when the call comes in, we fan out on all these attestation plug-ins, and then they return to us all these selectors or properties about the thing which called us. And once we get all the selectors back, we can consult that identity mapping in the server and say, OK, we now know that this is the SFID you should be issued or you are not authorized to have this particular SFID. So the selectors I mentioned before, UID, GID, et cetera, those are coming from the UNIX attestor. Those are UNIX primitives that we're using to describe the process. But if you're using Kubernetes, for instance, we would have a Kubernetes attestor. And that attestor would speak to the Kube load. And we'd first validate that this workload is in fact a Kubernetes workload. And from there it can discover all sorts of Kubernetes-specific information about it. You saw the example we gave, his namespace and service account and other things like this. So going back to the idea of trusted third party, we can see that we used AWS as a trusted third party for the server attestation piece. And then we used the Linux kernel as a trusted third party for the workload attestation piece. And this pattern, using a trusted third party to establish trust between previously untrusted entities, is a problem that's known as secure introduction. And secure introduction is a famously challenging problem. Most folks don't even know that this problem exists, to be honest. And this is one of the reasons that I love Spire so much, is because it takes these really, really hard problems and makes them easy to understand and easy to approach. And if you then go a little bit more about the examples that we just walked through, you can see this Spire Sol secure introduction for the server using AWS and for the workload using Linux kernel, right? And this is absolutely necessary if you ever want to automate secure issuance of identity. It's really, really difficult to securely issue identity in an automated fashion without doing a pattern that looks a lot like this one. And we're able to do this using Spire in a fairly generalized way. It's really easy to unlock additional support for platforms by writing these small little plugins that just go in. And so I did a lot of talking. And I realized this is a lot of... Sometimes it's kind of hard to hold it all in your head and visualize just from something like this. So I actually have a demo prepared to show you all. So let's hope that I can change my screen configuration now to make this thing work. Okay. So this is... Right here I'm in a Spiffy example repo. Is that clear enough for you guys in the back? Can you all see this okay? No? Is that better? Okay. So this is the Spiffy example repo. This is where we have all the demo code checked in. So you can go actually and run this demo yourself. So we'll be working out of this repo today. And so here's a diagram of what this demo looks like. I know I'm sorry in the back it's probably a little fuzzy, but I'll walk through it. This whole demo is run under Vagrant essentially, and there are three virtual machines in this demo. The top right green box virtual machine is the Kubernetes master VM. This is going to run Kubernetes API server. It's going to run also our spire server. On the bottom right we have a database virtual machine. This is Vagrant VM is kind of meant to be an analog of a bare metal host, right? So it's running MariaDB there. On the left the green one here is Kubernetes node, like a kubelet. And it's running a pod which has this blog slash forum sort of application running aside of it. And you can see we're using Ghost Tunnel here. Ghost Tunnel is providing mutually authenticated TLS between both database and the blog application. And you might also notice this little sidecar kind of box here. It's a little bit poorly named, but this is just a small bit of code which hits that workload API, pulls out the identity and shoves it into Ghost Tunnel. We haven't quite talked Ghost Tunnel, how to talk to that API yet. So eventually this thing will go away. So we just need to get around to. So without further ado, let's see. You'll see that okay. Let's see. So make harness will bring up these VMs and turn up like this multiplexed T-Mux thing. It's fairly well automated actually. Dave is, I think he's in the back, did a pretty good job on this. So what you're seeing here is those three VMs. Each horizontal row is a single VM. So the top row, the top row is the Kate's master. It's running API server. We'll run the Aspire server there. The middle row is the Kate's node, which is running Kublit, it's a worker node. And the bottom row is the database host. So the first thing we need to do here is obviously start the Aspire server. So we don't have a lot of screen real estate here. I hope that this guy's okay, we'll see. So just Aspire server run. Aspire server is now running here. So now that Aspire server is running, what we need to do is start the first agent. We'll start first agent on database host. Now we talked about trusted third party attestation and all this stuff. We don't really have a great way to perform this automated attestation in a vagrant environment. So rather than automating the provider based attestation, we'll use human as trusted third party here. So what we're going to do is we'll tell the server to generate us a join token similar to the way Docker swarm and some other projects work. And at the same time, we'll specify a Spiffy ID for this host. Okay. So here's the token. We'll copy this over. And we'll get this agent running. I'm also going to kick it into debug mode so we can just kind of see a little better what's going on here. Okay. So we see that the agent has started and it's picked up this Spiffy ID of db node that we gave to it as kind of arbitrary, just something to get the demo going. So now that's running. On the same machine over here on the right-hand side, what I'll do is we have this script. And this script basically just starts that little sidecar code which grabs the search and then also starts ghost tunnel. All right. So I'll run this real quick just to kind of show you what happens here. You can see that it's actually panicked. And the output is that zero bundles. And that's because we haven't registered this particular workload yet with Spire Server. Spire Server doesn't know about this thing. It's not entitled to any identity just yet. So what we need to do is we need to register this thing so that it can get its search. And we'll use the Unix attester to describe this particular one. And because it's running under my user account here, we can just see that I am UID 1000. So we'll use that to describe this particular process. So let's go ahead and register this thing. We're going to declare a parent ID of the database host, which is where it should be running. We're going to define the selector as that UID. And then finally, we're going to issue a spiffy ID. We're going to call it database. Okay. So you can see that it's been registered here. There's a unique ID for it and everything like that. And you can also see on this other side that the agent has now picked up this new identity that it can issue. So now if we go back and we run this thing again, you can see that it did not panic and we actually got a key and we got a certificate and now Ghost Tunnel is running. So if I actually look for Ghost Tunnel here real quick, we can see that we've fired it to say that you should only be allowing spiffy ID of blog to access this particular database. So this is mutually authenticated TLS. So now this thing is running. So we need to do the same thing. We need to start the agent on the Kubernetes host. So we'll generate another token except we'll call it KubeNode. And I should note that you can join all the hosts in your Kubernetes cluster in the same way. So they all have a similar spiffy ID. So you can do this mapping appropriately. We'll do debug again. Okay. So you can see that this looks a little bit different than the last one. These logs are kind of rolling over, right? And the reason that this is happening is because we saw that SideCard code just panics if there's no bundle available. So for the purposes of this demo, what we've done is we've thrown that SideCard code into a while-true loop inside the pod. So it's just waiting for an identity to be available, right? So now we haven't registered it yet, so there is no identity for it. That's why it rolls over like this. So we need to register something for it. And when we register, we don't necessarily want to register it using these Unix primitives, right? We want to describe it in ways that Kubernetes understands. So we'll register this thing. The parent is the kube node. And we're going to register a Kubernetes selector, a namespace selector, right? But the namespace is not very granular. We want to get a little more granular than that. So we'll also specify a service account as a selector as well. And finally, we'll give it a spiffy ID. We'll call it blog. Okay. So now you're going to see this has been registered, and this time it has two different selectors, right? So these are compound selectors. Both of these things must be true in order to be issued the spiffy ID. You'll further notice that this log has stopped rolling over. So that identity has actually been issued, and it's been picked up by this ghost tunnel. And if we look for it over here, oopsies, we should see it. There it is. So it is running, and similarly it's running and verifying the spiffy ID of the remote and it's being database spiffy ID, right? So now that these tunnels are both up, we should be ready to go here. So going back to our situation here, if I open this up, hopefully, with any luck, we'll be able to see a forum application. There it is. It's always pensive, the little moments in between. You browse around, everything works okay. And going back to the terminal here, just to kind of do diligence, we can prove that this is actually traversing with some TLS tunnel. We can kill-9 this proxy. You can see on the bottom here now that the server side has recognized it's been closed. And if we refresh this page, 500. So that's automatic certificate issuance of identity, which is understood both by Kubernetes and by bare metal. And you can imagine that if you scope these lifetimes down, you can really, really easily rotate these credentials without problem, right? So going back to my deck here, one second. Juggle my displays again. Okay. So this is almost the end of the summary. Kind of explored the value of universal workload identity, what you can get with it, why you probably need it. We learned about the Spiffy specs themselves, what they do, and how they solve this universal identity problem. And we also learned about Spire, a problem that solves how it works. We saw a cute little demo of it, actually, in action. And looking forward, we have a lot of exciting things for Spire Spiffy project. One of the things we're looking at is introducing a new SFID document that is based on JWT instead of X509. It would not replace X509, it would augment X509, right? One of the other things that we'd like to do, we're starting to begin work on, is extending the workload API to provide what we call Handshaker endpoint, which will allow workload to do mutually authenticated TLS with our remote workload without ever knowing its private key. And it does this by offloading TLS handshake onto this API endpoint. And perhaps the most exciting thing we have coming up is that there's an open proposal currently for Spiffy to be adopted by CNCF as an official CNCF project. So we're really excited about that one. So if you're interested in any of this stuff at all, we'd really love to hear from you. We have a super active community of over 150 people from 40 different companies, including Google, Square, Heptio, are all actively involved in this project. We have regular special interest group meetings as well, which are open to the public. So you're more than welcome to join. We have a bunch of GitHub repos which are also available. So this first one, the Spiffy Spiffy repo is where the specs themselves are housed, and we also have all the community information there, too. So if you want to join the SIGs or anything like that, this is the repo to go to. The second repo, the Spire repo, this is where both agent and server Spire code base live. So you can go check out how all that code works and how we do this magic. Finally, this third one is the Spiffy example. This is the one that we did that ran the demo out of today. And as you saw, it's fairly well automated. So you can visit this repo and you can go run this demo at home. Finally, the last Slack thing, Slack is also public, and we're there pretty much all time every day. So if you ever have any questions or comments or anything like that, we would love to hear from you on our Slack channel. And last but not least, we have a drink up tonight, and I apologize for the crazy obnoxious link here. That's my bad. But if you'd like to come, we would love to have you. So this is a really exciting time for Spiffy. There's a lot of stuff going on, and we think this is a really, really important problem. Please do keep an eye out for these projects because we're only just getting started. Thank you, everyone.