 Hello! Welcome to the KubeCon CloudNativeCon North America 2020 virtual event. My name is Daniel Feldman. I am a software engineer at Hewlett Packard Enterprise, and I am so excited to be here today. The title of this session is No More Motes, Protecting Your CloudNative Infrastructure with Zero Trust. Here's a preview of the topics we'll be discussing today. First of all, what is Zero Trust? Why would you want it to protect your CloudNative infrastructure? That will just take a few minutes. Then we'll discuss Spiffy and Spire, which are the CNCF incubating CloudNative security projects that I work on every day that help you implement Zero Trust on your infrastructure. Next, we'll discuss building systems with Spiffy and Spire. This is where we'll go over some design patterns that we've learned from helping dozens of companies get started with Zero Trust on their infrastructure. Next, we'll be discussing my favorite topic, which is the roadmap for Spiffy and Spire, this is where I get to talk about the features that I and my colleagues are working on for Spiffy and Spire to make them better for everyone using CloudNative infrastructure to implement Zero Trust. Finally, hopefully by this point in the talk, I'll have convinced you that Zero Trust is a great idea, and we'll discuss the next steps for getting started with Spiffy and Spire Zero Trust security inside your organization. Most organizations today use something called perimeter security. In perimeter security, you have a firewall, and everything inside your firewall is trusted, and everything outside your firewall is untrusted. So you have a bunch of internal servers, they're communicating with each other, usually over unencrypted connections. Then you have a DMZ, and that DMZ contains your email server, your web server, your VPN server, and those are locked down and only have encrypted inputs and outputs to the outside world. And meanwhile, these internal servers can't accept connections from the outside world at all. Now, this architecture was great 10 or 15 years ago, and it evolved because most organizations started with no internet connection at all, and they just wanted to connect their internal network to the internet, and they did it in the easiest way possible. But today, things are a lot more complicated. You have SaaS applications, you have business partners that need to access internal services, you have tons of stuff in the cloud. In multiple clouds, you might have some stuff in Azure, some stuff in AWS, some stuff in GCP, you might have different regions within all those different clouds. You might have managed databases, which are run by some external provider, and those are really hard to get inside your firewall. So as we add all this stuff, perimeter security becomes increasingly untenable. It's really hard to maintain, and even if you do maintain it, it's insecure if you make a mistake. One of my colleagues, Frederick Coutts, likes to say, we're protecting 21st century infrastructure with 14th century technology. We're building a moat around your network, like it's a medieval castle, so attackers can't get in. But that's just not the right way to do things for the future. ZeroTrust means giving each service wherever it runs, whether it's in the cloud, in your data center, in a SaaS provider, its own unique, secure, provable identity. And that's what I work on, and that's what we're going to be talking about today. SPIFI is the new open standard for how to implement a ZeroTrust identity provider. It includes APIs, file formats, and written English descriptions of how a ZeroTrust identity provider has to work, so that applications can talk to any conformant SPIFI implementation. SPIRE is a production-ready implementation of SPIFI. Both SPIFI and SPIRE independently are CNCF incubating projects. So you can go and GitHub, join meetings, comment on proposals, open issues, maybe resolve some issues. We encourage everyone to get involved in the SPIFI-inspire open source projects. What are the key benefits of ZeroTrust? The first benefit, and the one that most people care about, is defense in depth. That means if one service is compromised, the attacker can't use that service to move laterally within your network. The second benefit of ZeroTrust is there's reduced overhead for your security teams. Your security teams aren't manually maintaining that perimeter and creating and rotating credentials constantly for all your different services, because the ZeroTrust identity provider is just giving all those services their own unique identities. And the last benefit of ZeroTrust is that services can't operate at all without an explicit identity, and then you can use that identity for observability and logging. And this is really useful for your bigger companies that might have thousands of different services, and they've lost track of which services are talking to which other services at which times, or maybe they accidentally have some stuff in a dev environment that's talking to a production database. You really don't want that. So the observability and logging benefit really benefits the bigger organization. Now that we've talked about the key benefits of ZeroTrust, let's discuss Spiffy and Spire, which are the CNCF-sponsored implementations of ZeroTrust that I work on. Like I said before, Spiffy is a standard for an application to use a service identity provider. And then Spire is a production-ready implementation of Spiffy that you can download and get started using. Both of them are open source projects in the incubation phase of the CNCF. The Spiffy standard consists of four pieces. The first piece is the Spiffy ID, which is a standard format for a service identifier. It looks like this. It starts with Spiffy colon slash slash. Then there's a trust domain name, which is simply your company's unique name. Then there's another slash, and then there's a unique service identifier. Which can just be any string that makes sense for your company. So that's pretty simple. The next part of the Spiffy standard is a standard format for Spiffy verifiable identity documents, or SBIDs. And an SBID is a cryptographically verifiable document asserting a specific Spiffy ID. Now we actually support two different formats for these SBIDs. One is an X509 certificate, just a standard X509 certificate with a certain field filled in. It's a field that's used for other purposes with the Spiffy ID. The other format we support is JavaScript web tokens, or JOTS. And that's another standard format for cryptographically verifiable identifiers. And it's really useful in certain situations where X509 doesn't work. So we support both of those two formats. 90% of the time an SBID is in the X509 SBID format. The next part of the Spiffy standard is the standard format for trust bundles, which are sets of public keys used to verify SBIDs. So each trust domain will have a trust bundle, and that trust bundle will contain the public keys that can be used to verify any SBIDs within that trust domain. So that's pretty simple. That's very simple, similar to what your web browser has built in, or your operating system has built in, your phone has built in to verify servers. It's just a standard format for those public keys. Finally, the last part of the Spiffy standard is the workload API. And the workload API is the most complicated part of the standard. This is a local API that workloads can connect to and retrieve their own Spiffy IDs, their own SBIDs, and their own trust bundles. Now when I say it's a local API, it's a Unix domain socket, which is a service that's provided by the Linux or Unix kernel that lets you create a server that sits there and accepts connections. And when anyone connects to it, it has to be a local connection. The kernel will provide the PID, the UID, and the GID of the connecting process. So this lets any process connect to that workload API and get the right identity for that process securely based on the security of the operating system. So at this point, you might be thinking, this looks a whole lot like traditional PKI. I've got something that's basically a certificate, and I've got something that's basically a root certificate bundle. And you'd be right, we're using open standards here, but there's a key difference, which is that SBIDs and trust bundles are rotating frequently. They're rotating every couple of hours. So in traditional PKI, if you compromise a certificate, if you're a bad guy, you obtain a certificate that's really bad. You have basically unlimited access to the identity of that certificate, maybe for months or years. In Spiffy, those SBIDs and those trust bundles, they're rotating every couple of hours. So if you compromise something, the blast radius is very small. You only have a very limited window where you can actually access anything in the network. Now, there is a big problem here, which is that you have to use the workload API in order to get your SBID and your trust bundle. You can't just download the SBID, download the trust bundle, put them in a container, put them in a VM, deploy them later, and be able to use them for months or years like you could with the traditional certificate. Because these are rotating all the time, you have to use the workload API for each service to get its current SBID and its current trust bundle. Now that we've talked about the Spiffy standard for service identity, we can talk about the SPIR implementation of that standard. SPIR consists of two components, the SPIR server, which is responsible for generating and signing all the SBIDs in the entire system, and then the SPIR agent, which sits on each node and is responsible for serving the workload API. When the SPIR agent first starts up, it performs a task called node attestation. In node attestation, the agent proves the identity of the node to the SPIR server. So if the agent is running on an Amazon EC2 instance, it can collect the EC2 instance identity document and send it to the SPIR server. And that's a cryptographically signed document that proves that the node is who it says it is. If it's physical hardware, maybe it gets a key from the trusted platform module of that physical hardware and uses that to prove its identity to the SPIR server. If it's running in Kubernetes, Kubernetes provides a mechanism for proving that a particular pod is running on a particular node, and it uses that to prove its identity to the SPIR server. Once the SPIR server has that information, in some cases, it needs to talk to the API server for whatever platform you're running on. So for EC2, this actually isn't mandatory, but it can collect some extra information from the EC2 API server. For Kubernetes, you actually do have to take a token that's returned from the local node APIs and then send that to the API server through something called the token review API. And then the SPIR server can guarantee that that SPIR agent is running on the node that it says it's running on. Right now, we have implementations for a couple of different platforms. Amazon Web Services EC2, Microsoft Azure, Google Cloud, and Kubernetes, and we're working on more all the time. Once node attestation is complete, the SPIR agent has proven its identity to the SPIR server, and the SPIR server has an entry in its database indicating for sure that that SPIR agent is running on the expected node. Once that's complete, we perform another task called workload attestation. So first of all, the SPIR server is constantly sending every SPIR agent a list of all the workloads that are expected to run on that SPIR agent. Then when a workload connects to the workload API, remember it's a node local API that requires no authentication, then the SPIR agent checks the details of that workload that's connecting and checks it against the list of expected workloads, gets the right SPIFE ID, ESFIDs, and trust bundles for that workload, and sends them over the workload API. And remember, this is changing constantly, it's changing rapidly, so it's constantly pushing these to the workload. And again, we support some platforms, we support Kubernetes, we support Docker, we support raw Linux processes, and we're working on more all the time. So once you have one workload that has its ESFID, trust bundle, and SPIFE ID, and then you do the same process on a different node to give it its ESFID, trust bundle, and SPIFE ID, then they can communicate securely. Now, we don't specify how you communicate securely. Typically, this would be a mutual TLS encrypted connection, but there are other ways of doing it that make more sense in different contexts. So we're not a service mesh, we don't route the traffic between the two nodes. All we do is allow you to prove your identity if you're one service in a multi-service system. Now we have secure communication, which is the goal of a zero trust network. Remember that we had specific platforms that we supported for node attestation and for workload attestation? It's important that SPIRE is a pluggable architecture. So we have all these plug-in interfaces on both the agent side and the server side to let you add new support for new platforms as they're developed. So we have multiple workload attestor plug-ins, some for the operating system, different container run times, for the kubelet, you could add more there, especially as different container run times are developed or different workload orchestration engines are developed. Both the SPIRE agent and the SPIRE server have node attestor plug-ins and you need a matching node attestor plug-in on the agent side and the server side. And these plug-ins can implement different cloud platforms. So if you wanted to support DigitalOcean, you could easily write a plug-in for that. And we have existing plug-ins, like I said, for Google Cloud Platform, Microsoft Azure, Amazon Web Services, etc. Another type of plug-in that's very important is called an upstream authority plug-in. What this plug-in does is every time the SPIRE server needs its own new root certificate, it uses the upstream authority plug-in, which can then talk to some external service to get that root certificate for SPIRE. And remember, the SPIRE server is rotating its root certificate frequently every couple of hours, so it'll make this call pretty frequently. It's common to use the upstream authority plug-in to get a root certificate from something else that ties into your organization's PKI infrastructure if you have some standards in your organization already about who needs to sign certificates. Another option is actually using another SPIRE server as the upstream authority. So you can have a nested tree of SPIRE servers, and we'll talk about that in a few minutes. Another important design consideration for SPIRE that you need to understand is that it's really designed for defense in depth. So if an attacker compromises one SPIRE agent, it can't issue identities that are intended for another SPIRE agent. And this is very important because you might have 10,000 SPIRE agents throughout your infrastructure, and maybe one of them gets compromised because it's sitting on some edge node that is poorly protected. But even if it can talk to the SPIRE server, it's not trusted to just request any SFID for the entire system as a very restricted list of SFIDs that it can request. Now that we've talked about SPIFI and SPIRE, we can start talking about practical ways to use SPIFI and SPIRE within your applications. There are three main ways for a service to get its own SFID trust bundle and SPIFI ID. The first way is by directly accessing the workload API. And we've provided libraries for Java and Go that make that really easy, and we're working on a Python library. And then your workload can just call a function and get its own SFID and then use that to establish mutual TLS connection or present that to other services as proof of its identity. The second way is for workloads that you can't easily modify, and that's to use an Envoy proxy, and that Envoy proxy then talks to the SPIRE agent and gets the SFID and uses that for incoming and outgoing communication. And optionally in this configuration, you can use Open Policy Agent to make authorization decisions. This is actually a really useful design pattern because Open Policy Agent supports a wide variety of flexibility for which SPIFI IDs can talk to your workload using a very fancy configuration language called Regal. So then you can outsource all the responsibility for deciding who gets to talk to your workload to Open Policy Agent, which means your workload doesn't have to worry about it, and you can change it in configuration files instead of in code. So this is actually the most common way right now to use SPIFI identities within your infrastructure. And then the last way, this is a little bit more rare, but this is really good for workloads that already provide their own authorization framework, in particular databases like Postgres. This is called the SPIFI Helper, which is a process that we provide that talks to the SPIRE agent over the workload API. It gets your SFID and trust bundle, and then it feeds that into the workload using a script, and it just uses a shell script that's really easy to write. And again, this is really useful for databases like Postgres that already have their own concept of users with different levels of access permission, because then the SPIFI Helper can just map those SFIDs that it gets and map those to users with different permission levels within the database. Now that we know the basics of using SPIFI and SPIRE identities in your workloads, we can talk about some zero-trust design patterns that I think everyone should know about in order to take the most advantage of SPIFI and SPIRE in their environments. The first design pattern is high availability. The SPIRE server is actually stateless. It stores everything in a data store, and we support several different types of data stores. If you use a high availability data store, then you can run multiple copies of SPIRE and multiple instances of that data store and have an active, active, high availability configuration. So this means that if SPIRE server one or data store one or SPIRE server two or data store two crash or the instances that are on Get Deleted or all kinds of different things can happen, then SPIRE continues to run. This is really important because SPIRE needs to be high availability. If SPIRE goes down, then very quickly those estimates are going to start to expire and your workloads will no longer be able to communicate with each other. So it's very important that the SPIRE servers be deployed in a high availability configuration. So this is one design pattern that I think every organization that is looking at SPIFI and SPIRE needs to think about. You can have these different SPIRE servers in different availability zones or running on different racks if you're on physical infrastructure, so they're less likely to fail together and bring down your whole SPIFI identity infrastructure. The next design pattern I'd like to talk about is using separate trust domains for separate environments with separate SPIRE servers. So you can have a dev environment with one SPIRE server and a prod environment with a different SPIRE server. And then the workloads in the dev environment won't be able to talk to the prod environment workloads because their S-vids won't validate against each other. So this is really useful because we have multiple environments that need to be completely isolated, even if they're running maybe on multiple different clouds, multiple different on-prem servers, multiple different Kubernetes clusters. It would be challenging to isolate them normally, but by using separate SPIRE servers you can have completely separate trust domains and completely isolated environments. This is one of the key advantages when you're just getting started with SPIFI and SPIRE. The next zero trust design pattern I'd like to talk about is something that our users are just getting started with right now. So you have some CI-CD system, continuous integration and continuous delivery that takes your source code. Every time you make a change to the source code it is updating a container image with a compiled version of that new source code and uploading that to some artifact repository or container registry. You can tie that in to the SPIRE infrastructure. So the hash of that container is tied to a specific SPIFI ID and then only that container will ever be able to get that SPIFI ID. So this is really useful for high security environments because even if an attacker compromises the contents of the container later on and replaces some binary and some scripts, because the container hash has changed it will never be able to get the SPIFI ID it's supposed to get. So this is really good for high security environments where you really want to be able to tie up a SPIFI ID to a specific build hash. It is challenging to set up though because you need to have some scripts in your CI-CD system that talk to the SPIRE server and configure the SPIFI IDs and build hashes. The next design pattern I'd like to talk about is using SPIFI IDs to talk to a secret store. So you probably already have a secret storing your infrastructure. HashiCorpVault is one of the popular ones although there are plenty of others. You can use your workload SPIFI ID to authenticate to the secret store and then you can talk to that secret store get a credential and use that credential to access some non-SPIFI aware servers. So this is really useful if you want to gradually deploy SPIFI inspire throughout your infrastructure without just having to do a big bang and SPIFI is every service all at the same time. And then gradually you can replace those hard coded credentials in the secret store with a zero trust pattern to use SPIFI authentication to talk to them. So this is a really good way to gradually roll out SPIRE over time. The next design pattern is in the same vein. In this design pattern you use something called OIDC Federation. So the SPIRE server can provide its trust bundle, its root certificates in a format called OIDC Federation and this is supported by a couple of different external APIs. The big one is AWS and AWS allows you to map different OIDC identities to AWS IM identities. So the net result is that you have some workload that's running in your SPIRE environment that workload can access the AWS API it will automatically be granted an IM identity and then it can use the AWS API for all kinds of different purposes maybe accessing database in the Amazon RDS database. So the cool part about this is you can have thousands of processes that are running on-prem and they can talk directly to the AWS API without ever having any fixed access token or secret key for AWS which is really great for security and really great for managing those credentials. So OIDC Federation is really a key feature for a lot of organizations that are using AWS. Some of the other cloud providers are starting to support OIDC they're not quite as mature yet but they'll get there. The next zero trust design pattern is called nested SPIRE and nested SPIRE looks a lot like high availability at first. In this design pattern you've got a global SPIRE server and then you've got multiple intermediary SPIRE servers that function as intermediate certificate authorities. So each one of these intermediary SPIRE servers lower down is repeatedly fetching its root certificate from the global SPIRE server every couple of hours and then it's talking to all the actual SPIRE agents that authenticate workloads. The advantage of this design pattern is that if the data center on the left goes down it doesn't affect the data center on the right or if the network connection gets really slow between the data center on the left and the global SPIRE server it doesn't affect the data center on the right. This is really useful for separating multiple failure domains. We're working with companies who have thousands of servers in different data centers and they really don't want to be in a situation where if access to one data center becomes slow or goes down it causes failures in other data centers. So nested SPIRE is really ideal for separating failure domains. Each one of these SPIRE servers especially the global server and then each one of the intermediary SPIRE servers should also be high availability itself just to further reduce the chances that any one of them will go down. The next design pattern I'd like to talk about is federation. So this is SPIFI federation it's actually specified in the SPIFI standard so multiple different implementations of SPIFI will be able to talk to each other. And in this each SPIRE server has its own trust domain its own set of agents its own set of SPIFI IDs but what they're doing is they're constantly engaging trust bundles and feeding those trust bundles to all the SPIRE agents on their side. So that means any workload running on the left side will be able to authenticate any workload running on the right side and any workload on the right side will be able to authenticate any workload on the left side. What makes this different from nested SPIRE and this is very important is that these are still two independent SPIRE servers with separate trust domains and separate configuration. So this is really good the system on the left is owned by one company and the system on the right is owned by another company and they want to be able to communicate without trusting each other fully or in some organizations you may have multiple different divisions that need to be able to communicate without being able to trust each other fully. So federation is really useful for separating security domains it isn't designed to separate failure domains as much as nested SPIRE and you still need to use high availability in order to make sure that if one SPIRE server goes down nothing bad happens but it's really good for separating security concerns between one organization and another organization. I'm a software engineer on SPIRE I'm not in sales and marketing so I really like to talk about the roadmap for features working on SPIRE we are just about hitting our SPIRE 1.0 release with all the features I've discussed so far but we have a lot on board for releases after 1.0. The first item on the roadmap and one of the ones that will make SPIRE that will improve SPIRE the most is having an improved data store layer so right now we support using MySQL or PostgreSQL as that data store layer where we store all the information about what nodes are attested what SPIFI IDs are available to be granted, that kind of thing that's a little bit limiting so we're working on an improved data store interface that will let you talk to all kinds of different distributed databases as the data store layer that will improve our performance and production readiness across the board so that's one of the things I'm most looking forward to the next thing is support for trusted platform modules so you might know this, HPE is the largest server company in the entire world every server we make has a trusted platform module which is a secure chip that comes built in with a certificate signed by HPE we should be able to use those TPMs as a source of truth for node attestation right now that feature isn't quite there but we are working on it and it will be available soon and then when you buy a server from HPE or any other company that provides a server with a trusted platform module you'll be able to plug it in and perform node attestation right away the next feature we're working on is support for serverless functions so you might know this in a lot of cloud infrastructure now sort of the glue is serverless functions these things don't require an instance to run they don't require a kubernetes cluster they're just built into the cloud environment amazon calls them lambda functions google calls them cloud functions every cloud provider has these we're working on a really good mechanism for supporting spiffy IDs in serverless functions that isn't there because there's no agent there's no obvious place to get the spiffy ID and we're working on some glue to generate spiffy IDs and feed them into the serverless functions so then they can talk to other traditional non-serverless services securely the next big feature we're working on is something called certificate transparency and this is actually coming from the folks who are at ByteDance the company that makes tiktok they really want this feature in certificate transparency every certificate that's generated is added to a certificate log and a certificate log is generated in a very clever way where it's cryptographically verifiable what was added to it so every single svid that's been generated across your entire infrastructure maybe many different spire servers many different trust domains will go into this auditable log and then you'll be able to set up some alarms so that if something goes wrong if someone gets in and manages to generate fake bad svids you find out right away the last thing is improved kubernetes support so right now we already have kubernetes no data station and workload data station like I mentioned what we're working on is making it more automatic and again this is the contribution that's coming from outside HPE we're working on a workload register that will go through your kubernetes configuration and automatically do the relevant spire configuration so that every kubernetes pod that starts up automatically gets its own spiffy id that you can then use to establish zero trust so I had actually been working on spire for about 2.5 years and virtually that entire time whenever I talked about it to anyone they always asked who's using it and I always had to say well no one yet but at this point in 2020 we're getting to the point where a number of big companies are using it so first of all we have some cncf projects that are ingesting spire wholesale and using it as their identity plane within a different infrastructure product so kuma and network service mesh fall into that category and you can see other talks about those cncf projects at this conference then in terms of end users there are a ton of big companies that are rolling spire out across the infrastructure because they're trying to go to zero trust byte dance which makes the tiktok app uber, square, github, bloomberg, stripe anthem transfer wise those companies have all given public talks on using spiffy inspire that you can google and they're all linked to from the spiffy website there are other companies that won't talk about it but are starting to roll out spiffy inspire finally the last end user is really cool hpe also owns kray supercomputers which makes the fastest supercomputers in the world so in a couple of years these excess scale supercomputers with millions of cores will have spire running on every single node in order to provide zero trust identity to supercomputing workloads which is really cool it's really cool that we've gotten this far in just a couple of years lastly I'd like to talk about the next steps for adopting zero trust with spiffy inspire first of all we have a website with tons of documentation and links to every presentation we've ever done on spiffy inspire that you can click and watch we also did a community day as part of this conference with people from a number of different companies talking about how they're using spiffy inspire we also have a very active slack community this is separate from the cncf slack and the kubernetes slack this own slack community with people from a number of different companies hundreds of different people talking about how to roll out zero trust security using spiffy inspire we've got our own twitter of course follow us on twitter and we've got a book coming out soon I actually worked on this book along with about 10 other people it was published by hpe if you shoot me an email I'd be happy to send you a pdf of the book or even a paper copy I think it's really good my favorite part of the book is that it has five detailed case studies for how certain companies are using spiffy inspire in real life and why they made the decision to go with spiffy inspire I think those case studies are incredibly valuable if you're thinking about making a case for spiffy inspire to your management so with that I'd like to thank everyone for watching again my name is Daniel Feldman I work in hpe security engineering here's my email address follow me on twitter shoot me an email to follow up and I think we have a few minutes for questions