 So thanks, thanks everyone to come in for a session and for everyone who dial in the video. We're going to talk today about Spiffy and Spire. I hope some of you heard about it. If not, I'll do a quick recap. But the focus for this presentation is more around how to get Spire from proof of concept to production. I spent around the last five years since like 2017, 2018 kind of barking on tree of zero trust ideas. And then following Spiffy and Spire project from the early days, being building different systems and products around zero trust. Using Spiffy and Spire underneath, I've spent a couple of years at Biden's helping to rebuild health indication and authorization system using Spire. And this is a world largest deployment today, running and scaling it beyond one million nodes. Since then, I kind of started to think that it's very powerful technology, but it is kind of, you can think about set of building blocks. You need to turn it into a product. So I started thinking how to make this technology more accessible to people. And in 2020, we brought with bunch of other smart forks book called the solving the bottom turtle. We'll have a Spiffy Spire booth this year downstairs. It's a truly open source project, community driven. And thanks for CNCF to sponsor in it. We'll have hard copies of the book, so stop by and I have three of them, I'll base myself here. If you have any questions, come and chase me. I'll give you a book and you can find probably five or six other co-authors for this book during CNCF and get signatures if you want. So what is Spiffy? I hope that's everyone who came here has heard about it or no, but here's a quick recap. It's basically set off standards. The two projects, Spiffy and Spire, Spiffy is basically a bunch of markdowns that describe what is a Spiffy identity, what is a Spiffy verifiable identity document. You can think about Spiffy ideas, basically URI string that's baked into identity that comes with the two formats. One is X509 and another one Jode. And it also describe things like what is the Spiffy trust bundle and how do you get it? And the biggest part of that is workload API. Workload API is just basically a set of APIs, obviously how workload or application can get an identity, refresh identity, verify that identity. And yeah, so this is just specification. Quick, what is the Spiffy idea and how it look like? As I mentioned, it's just a URI string. It has URI schema, it has a trust domain port and it has a path. It's like pretty similar to URL, obviously, but it's only allowed path doesn't come with like other things. You can put anything into their path. So it does need to be just one word as the name of your microservice. You can build a different schemas, for example, that would incorporate information like your, which region this workload is running or in which data center, et cetera. So it could be flexible, but there are certain limitations. So because it's going into Jode, for example, so it will have a certain limits on how much information you can put in there and the same 4x509. So what's Spire? As I mentioned, Spiffy is just a standard and a Spire is a production ready implementation of the standard. It is an open source project, contains two main parts. One is a server and another one is an agent. So it's implement the whole specification and it contains bunch of different plugins for attestation between agent to a Spire server and between workload to agent. So basically how it works, you put an agent on nodes, then you agents connect to a server. There is a certain attestation mechanism between agent and a server. For example, if you're running in a public cloud, you can use an identity documents for verification of firm agents and for that agent perform attestation for workload. So when workload connect over workload API to an agent to get an identity, basically it's process starts goes to workload API and it's like, who am I? And there is an attestation kick in and provide workload with identity that it can use to go anywhere. Because it's two different formats, you can use it to build an MTLS, for example, where you can use jobs for authentication and then hook it up with authorization systems that you build on top of it. So typically when you kick off this, like I wanna use Spire project, it goes into several stages. I kind of define these five stages, but it could be three because you squash like first three stages, research, visibility and proof of concept together. But what it do during this first three or one stage is basically trying to answer on different questions. Like what are these? Do I really need it? I have a bunch of problems in my infrastructure and I want to solve them. Like some of them could be, we need MTLS or we need better kind of universal identity for everything so we can use it with authorization together. And the goal there is to basically figure out set of use cases and priorities for your Spire rollouts. Is it MTLS is important where you want to use PTA identity for federation and to get some credentials that you store or secrets that you store involved. Or you can use it for federation between a different Spire identity or systems where you can completely replace your static secrets like an API key is on or AWS, GCP identity is doing this Spire identity exchange to your cloud service provider exchange. Or you can do this like within any third party as well. Yeah, so this is just to help you get started. There's like lots of information you can find about use cases and how you can use PFI on website. You can go to and ask questions in Spifi's lab. And I also build in collection of kind of very opinionated about different use cases for Spifi and Spire as well. So next stage is like when you define your priorities and you want to get an internal buy-in from different parts of your companies like you need to talk to security, history, et cetera. And this is where you normally build some proof of concept to show like how it will be looking when we get closer in production. And what we see, lots of successful stories actually start with the Kubernetes rather than this kind of long tail of legacy or serverless. It's not like I call legacy, it's not like a true legacy. It's more like applications that's running on dedicated notes. It could be really modern applications but it's a different mechanism how you can do things like deployment, registration and attestation of your agents and workloads. So Kubernetes is like really easy way to start and really easy way to prove the different use cases and get internal buy-in. But when you get to this stage you probably have lots of questions, right? It seems right thing to do. We love this use cases, we love technology. How do we run it in production now? And running things in production means different things. If you go to developers or you go to SRE or you go to DevOps or you go to security you'll hear different answers. So if you go to SRE you probably think that they will ask is it scalable? Like how do you prove that there is no downtime or do you even have 24 by seven rotation and on call because this is authentication system and it will be kind of foundational for the rest of our applications to talk and authenticate to each other. So there are lots of questions but we'll go through some stages and my goal is to not to answer to all questions but give you some mental framework how we can think about what I need to do or what I need to think about before putting this into production. The first thing you start with an understanding where is your trust boundaries and how they map into specific trust domains. There is a different ways to think about it. You can have like traditional per environment trust boundaries and trust specific trust domain. So you have production environment and that it will be your one specific trust domain and your staging will be different. So they're kind of independent systems. Then the next thing you wanna do is to think how this mapping to your PKI. So if you're building two independent systems one is for production, one is for staging, one is for development you probably wanna have a different PKI route for that and then you need to think where like what is the shape of your PKI? Where do you store a key? What's the TTL for all these keys? How do you rotate them and how all this connected with a Spire servers? Federation is another interesting thing because you can federate two independent specific systems to each other and for example, you can follow pattern like which Kubernetes use where you have one PKI for Kubernetes cluster. This is your blast radius. This is your trust domain. This is your specific domain. But if you have multiple Spire clusters in a Kubernetes cluster then you have too many entities to manage. In order for them to talk to each other like application deployed on 10 different clusters you need to federate specific identities between all of them and that would basically affect your bundle size. So you will need to have a lot of root keys and that bundle the size is growing. If you use JOT, the number of keys there also will be growing. So it's all kind of can affect your performance. And I think when you're going like beyond 10 entities, maybe you should start thinking about a different shape or a different architecture for Spire. Trust model also goes pretty close to this investment into how many things you need to build on your own in order to run Spire. This is kind of, I'm trying to give an idea if you trust something, this is how much less or more you need to build. So for example, if you're talking, if you want to run Spire with Kubernetes and you trust Kubernetes control plane, so you can use Kubernetes primitives like a demon set for deployment of Spire agent. You can use Spire controller manager for their registration of Spire of your workloads and you can use a Kubernetes PSA for IDA station. If you do not trust Kubernetes control plane, you'll need to build all this, right? And it's all depend on how you deploy applications, right? Whether it's like one cluster or multiple of them. There are a couple more talks by Uber, Tyler and Tyler, how you can think about their scheduler integration with Spire. And as an example, if you use a you deploy and cloud service provider like AWS, GCP Azure, and you do not trust it. So this is probably pretty hard in terms of like how do you do IDA station? In this case, if you do not trust your cloud service provider, probably you can rely on something like a VTPM for not IDA station, but you need to do research. I don't actually know if that's possible completely to do, but that could be like one of the directions. Another thing you want to focus on is like what will be the Spire architecture that you use for the deployment? It's a pretty big topic. And one gave a pretty good overview. I think it's like 30 or 40 minutes. So here's the link. I'll put a slide so you can click and find it. So don't worry about that. But yeah, basically there are a few models that Spire support. One is a single Spire server or a cluster obviously for a high availability you want to have a multiple version of Spire running to make sure that there is no downtime. Nusted model is pretty good in cases when you want to run Spire and use different IDA station primitive, let's say for different clouds and for your on-prem deployment. You can have brand multiple of them. And Federated is another model. So you might run one Spire deployment in your on-prem environment and another one in a cloud when you migrating in there. You can federate these two systems whether mutually. So your workloads running in public cloud they can talk to your workloads on-prem or you can do directional, right? Your on-prem workloads would be able to talk to cloud but not other way. Yeah, and so there are more information on this link is great video. So the trade-offs that you'll be doing or like I want you to think about different trade-offs when you're thinking about architecture and how to deploy Spire in production. Usually when you're talking about security and availability, these kind of two trade-offs that's coming in, manageability is another one and a cost factor is another thing that's unique to think about. The good example for this could be we have 10 Kubernetes clusters in production, right? We run each cluster in different availability zones. So to build and to have Spire in highly available mode you want to run one Spire server at least per availability zone. So you ended up with at least three. Now multiply it by 10 clusters you'll get to like 30 instances at least. And then that's for availability. Then you think about security. I probably don't want my production workloads like an application running on the same node as a Spire server because if that workloads get compromised potentially that would be a way for an attacker to get into the Spire server. So we want to run them on dedicated node, right? And now you're talking about the costs, right? Because now you have basically 13 nodes that only run in Spire server you can't run anything else in there that could be like pretty costly. In this like, if you find yourself in this situation probably you wanna rethink your architecture and do not run Spire inside the Kubernetes clusters and run dedicated clusters of a Spire server. And in this case, if you talk about let's say East cost and a West cost have a two clusters with N plus two availability you'll have a three servers in each region. And that would be six nodes compared to story. So there's always kind of balance between things. Data store and when it's come to production is another interesting piece. And it's kind of when we wrote a book about all these turtles all the way down basically Spire is your systems that's provide identities to everything in your infrastructure, but it needs a database. So how do you authenticate a database if you kinda get into this secure self bootstrap mode? One of the thing that we've been doing is like you probably noticed there is like a meta PKI because we've been building at Biden's everything internally. We had this kinda smaller turtle smaller infrastructure that's been running and we use it for authentication between Spire server and database. The biggest challenge there is like if you use it if it's come to database like a Postgres or a MySQL you normally people use like username and password how do you rotate them, right? Rotation inspire server world requires server restart. Now you're talking about availability and then database is a usually bottleneck if you have many, many nodes and workloads like if you're talking about more than 10th of 1000 of them. My bottom line would be if you run in a cloud service it's better to use their primitives for database authentication and use managed database. If you run in it on-prem and you build in the database it's a little bit more challenging that this is where you need like a meta PKI thing. It was a great talk by Matthew from Square about how they build and scale a database on-prem for there with much more details than I can package into 25 minutes. Performance, probably the question you will get often from SRE develops their platform teams. It basically boils down that things are much better now than it was two years ago. If you're running, if your scale is less than 10 K nodes and you don't have like, I don't know hundreds of 1000 microservices running you probably shouldn't worry about it. You'll likely won't run into any challenges in there but there will be some links for additional resources like how you can think about the sizing of your as pyroclusters and what node types and how much resources you need depends on how many workloads you have. Monitoring and alerting, you'll hear this a lot from SRE team, like how do you know it's working? How do we know that it's not broken? What's gonna happen if it's down? Do you have like on-call rotation for this? So, Spire provides a lot of telemetry for both servers and agents. Though there is no dashboards, it's kinda you need to build it, you need to understand what's important for you because Spire is a lot of, as I mentioned before, building blogs and for example, anything related to data store could be data store specific and you probably want to have more metrics from your data store but help once it if you want to contribute and help the community, this is like one of the things that's still looking help with. And I wanna quickly touch base about logging and alerting. This is kinda through these systems that's also kinda, we always tend to think at the end there is one important thing though for, you need to bring your own logging and auditing systems and Spire can be plugged into that pretty easily but one thing I wanted to point out here specifically to auditing, Spire server log contains a lot of information about agent attestation but agents contain information about workload attestation. It's like basically based on what condition this workload been given some identity. This information is like very useful for detection and response purposes. It's very useful for, I don't know, some incident response that's happened three months ago. You probably want to preserve this information specific to attestation for longer time like one or two years probably but you'll need to kinda, this is something that you'll need to build. It's not like there is just log file basically in Spire. And the last thing I wanna quickly emphasize is on disasters and every like different scenarios and your comfort level. You probably want to start with a very high TTL for your identity so I would recommend like 30 days probably before you get comfortable and then build a plan how you lower it down. The big issue is there is like if Spire server is down then your nothing is working right? So everything that's already half identities will be able to use it but if they will expire then everything like you don't have authentication everything is broken. So your level of comfort is just basically how quickly you can rebuild an infrastructure and the way we thought about it is like if you burn down the Spire info right now if you can rebuild it in one hour so that double this is like how level of comfort for how long we wanna issue this identity. So it's like two hours in this case. Right, so give us some room to breathe. Yeah, automate everything from start to finish, test everything, do different game day activities and house engineering. The big message here is just make sure you're comfortable to and you understand what's happening when you, when something goes wrong like especially for monitoring and alert. Yeah, so there are some links for additional resources on this slide if you have any issues or have questions, find me, go to spiffy.io slag and thank you everyone for coming in here today. Now it's, okay, the question is whether Spire is for workload identification or for software identification, right? So the main goal of Spire is to provide identity to workloads and help to verify these identities and identities came into formats X519 and JWT tokens. It doesn't do, in the process of providing these identities, Spire agent does a measurement, right? So, but you need to know in front of, like before providing this identities, you need to know something about it, right? Like what user is running if it's under, if it's UNIX process, right? Or if you're running it on Kubernetes, what container image it's using, for example. So it does this measurement, but it's different from like software supply chain, for example, right? So thank you. Any other questions? I would suggest to think from there in this way, right? So it's like, I don't remember what's default TTL, probably one hour. I think it's too low, like, because if something goes wrong, you literally have a certain minutes to recover. If that's your level of comfort, this is great. I think we started with like seven days or something like this when we rolled out in production or maybe even like 30 days before you get comfortable and operational teams also trust you, right? Because remember, this is, everything depends on it in your infrastructure. Cool, any other questions? All right, thank you. Thank you so much. Thank you for coming in.