 Everybody, welcome. Thank you for coming today. Hopefully we're going to have a pretty good talk about some really fascinating riveting stuff, our NIST standards. But hopefully this will be some good material for everybody. So a couple, I kind of lied a little bit, the title is a little bit of a misnomer. I mean, I guess technically the standard is emerging, but it has emerged, it's been published, the final paper is out. And actually, we don't write standards at this, we write guidelines. Other folks write standards based on the guidelines. But this is now our published guidelines for Zero Trust. And specifically, 800207A is focused on runtime, Zero Trust. So before I dive in a little bit about me, I'm Zach, founding engineer at Tetrate. I was one of the early engineers on the Istio project and I worked across Google Cloud before then on things like the project service, on stuff like identity and access management, on our service mesh, and then ultimately on Istio. The other big hat that I wear these days is doing research as well as writing standards or guidelines, witness on microservice security on Zero Trust and on access control. So with that today, the first thing I want to equip you all with is a working definition of Zero Trust. I think there's a lot of fud in the space around what Zero Trust is or isn't. And so we want to get to the bottom of that and have a real concrete definition. As part of that, I'm going to introduce identity-based segmentation. That's the primary idea that we brought out in 207A. If there's one thing that you take away from the talk today, take away identity-based segmentation. We're going to talk about then how we can start to move incrementally from where we are today into an identity-based model, maybe eventually realize identity-based segmentation and then finally I'll spend a little bit of time talking about one of the ways that you can choose to implement this, which would be with a service mesh. So first and foremost, let's start to define Zero Trust. I actually hate the name. We really, it should be about zero implicit trust, right? The problem is not trust. The problem is when it's not made explicit, so we don't necessarily know what access is happening or we're not authorizing and authenticating it and doing a couple other things I'll talk about. So what are some different ways to think about Zero Trust? One of the framings that I give is, if I were to pick any workload in your infrastructure and expose it to the internet, what's the impact of that? What's the damage that somebody could do? Another way of phrasing that is if a motivated attacker can already be in your network, I'll talk about why, but if there's an attacker in your network, how do you limit the damage that they can do? Two ways of thinking about the same fundamental prospect, fundamental idea. And so if we start to think about, how do we handle that with the perimeter first? And then we can get to how do we do it without the perimeter? And the immediate thing that comes to mind is something like an API gateway or your front end web service or something like that. And there we usually have a pretty routine flow. A request comes in, it's got some end user credential on it, maybe a job, maybe an API key, something like that. We take that credential, we authenticate it, we validate that it's allowed to do the action that it's attempting, we authorize the action, and then we forward it onto the app. So if we're gonna pick workloads in our infrastructure to expose the internet, or if we're assuming an attacker's inside, we probably wanna do a very similar kind of thing for our service-to-service communication. We want a runtime identity that we can authenticate, we wanna be able to use that then to authorize service-to-service access, and maybe it's a nice to have, because Spiffy is cool, or if we use something like that, then maybe we can do that identity as a certificate, and we might get something like encryption in transit for free as part of doing that service identity. Maybe you get the service identity for free as part of doing MTLS. And then the final piece of the model that I wanna give you all is, why can the attacker be in the network, or why does this attack model matter? Any day of the week, we can pick kind of a different one. This one happens to be about a decade old, is a drawing from a government agency of the internal network of somebody I used to work for. So a motivated attacker, or maybe not an attacker, but somebody who's motivated can definitely be inside the network. And really this, among other things, is what spawned ZeroTrust as an idea and is why it's gained traction today. So fundamentally, what do we need then? So if the game is that an attacker can be inside the perimeter, what do we need to do? The answer is we need to bound the attacks in space and in time. We need to limit what can happen. We need to limit the blast radius. We need to limit how attackers are allowed to pivot inside our infrastructure, and we can use things like authentication and authorization to help do that. So ZeroTrust, we can't base our access decisions on the network perimeter. We need to base them on something else. What do we base them on? We want them to be least privileged. We want the access decision to happen per request. We want it to be based on the context of the request, and we want it to be based on identity. So what do I mean, maybe one of the ones that makes the least sense there is context-based. So what do I mean when I say we want a context-based access decision? Suppose that I'm looking at one of our corporate resources, I'm on the internal website like that, and I'm hitting refresh and I'm logged in, I'm authenticated. It's coming from Zach's device that Zach always accesses it from. Maybe it's even a trusted device, right? We're probably gonna allow a lot of access. Now, if five minutes later, I've been accessing from, say, the US, and five minutes later, that same access with Zach's credentials comes from a different device from, say, Eastern Europe, that's probably not something that we want to allow a lot of access, right? It looks a little fishy. And so that's what we mean when we say something like context-based, and that's why we want context-based access, not just to identities, not just doing it per request. There's other factors that we should evaluate. And that leads us into identity-based segmentation. So we also call this zero trust segmentation, but the idea is that we want to isolate workloads. That's what micro segmentation does. We wanna do that with well-defined policies, and we want those policies to be based on cryptographically-verifiable identities. We want those identities to include the user, the service, the device. They shouldn't be primarily based on something like the network location, the IP address, but certainly that can be a factor in something like a risk-based authentication and authorization system, right? The example I just gave with geolocation would probably be derived from something like the network IP address. So what is identity-based segmentation? It's these five things, right? So after this talk, and you'll be able to count to zero trust on one hand. First, we want to encrypt and transit. We have two reasons for that. In general, we don't need to just encrypt everywhere because encryption's cool. There's two capabilities that encryption brings us. One is message authenticity. So I can be guaranteed that nobody's gonna tamper with or change the message that I sent. The second capability is eavesdropping protection, right? So I know nobody else can see what I'm sending. So I can send sensitive data and I can be confident that it's not gonna be tampered with in transit. Then I need to identify and authenticate services. I need to use that authenticated service identity to authorize service-to-service communication. And I wanna do the same for the end user. I want to, at every hop, authenticate that end user and authorize that end user to resource access. And we argue in 207A that if you're doing those five things, you're achieving a zero trust runtime, right? Again, if we pop the stack back to our working definition, we wanna mitigate what an attacker can do in space and in time. Doing these, or another way to think about it is we wanna expose it to the internet. Doing these five things for any of your services helps then mitigate what that attacker can do. We'll talk about that more. So if identity-based policy is interesting, something we probably want to move to because it gives us flexible policy, it gives us policy that a human can read much more easily, it gives us policy that a human can edit much more easily and much more confidently than that work-based policy, how do we move incrementally there, right? We're not gonna just big bang one day go, hey, we're throwing out all the segmentation, we're throwing out all the firewalls and we're just gonna have service mesh and MTLS and identity everywhere, right? That is not realistic for a variety of different reasons. And so again, we wanna go from today's world into maybe a better world where we're applying some of these zero trust principles. So first and foremost, policy at a single layer is problematic, right? Network policy doesn't really scale effectively to the kinds of dynamic environments that we have today. When I say network policy, I'm not really talking about Kubernetes network policy and I'm talking about things like firewalls, things like micro segmentation, that last generation of technology that we used, right? And particularly it's highly dynamic cloud environments are particularly problematic for these kinds of systems. At the same time, we're pretty early in the universe of identity-based policy in our industry. And there's a variety of problems that you start to run into with identity-based policy, things that we've already hit at network policy insults maybe 20 or 30 years ago, right? So for example, a lot of folks today, if you're writing an SEO authorization policy for example, are listing individual service accounts. That's kind of like writing a firewall rule with individual IP addresses, right? But today there's not higher level abstractions to describe a group of identities or the name of a service across four different cloud providers with different ARN and service account and similar, right? So that's a challenge with identity-based policy today. How do we get that uniform identity domain so that we can write consistent policy? And then finally, we can't get rid of network policy entirely anyway just because our regulators and our auditors still expect there to be network level controls. So even if we thought that that was a good idea, we can't do it today anyway. I don't necessarily think it's a good idea. I think you want some defense in depth, by the way. So we want at least two layers of policy when we start to talk about multi-tier policies. I say at least. 207A, we were very intentional in keeping the scope very small. If you want to go read the SP, it's about 12 pages of content. And so as a result, there's a lot of stuff we didn't talk about. So think of this as very much a minimal definition. So certainly we want network-tier policies, things like firewall rules or segmentation or micro-segmentation. We would like to move into a world of identity-based policy so that we're not basing it on network and perimeter and similar. But there's also plenty of other types to think about. So one that we don't address in the SP is something like an application-tier policy, like doing WAF or doing request payload validation, or request schema validation or similar. So those are all good things. You probably should do those things. But they're not in our minimum definition of zero trust. We're not saying don't do them. We're just saying this is a minimum. There's more you probably want to do. Multi-tiered policies layering identity and network-based policies together is realistic and non-disruptive. So in particular, and I'll show you some examples in just a second, what we can start to do when we layer these systems together is relax some of the policy at the network layer that tends to kill agility and augment that with policy at the identity layer that brings back some of the assertions that we lost when we relaxed the network policy. And because identity-based policy, we can edit more confidently. We can read more easily. We can understand what it's going to do. We can have a much higher rate of change there than with traditional network rules. And we can do this in such a way that we're still exposing our traffic to all the traditional inspection points if that's needed by our organization. And I'll touch on that in just a second, too. And like I said, so the outcome here is that we can start to relax these lower-level policies as long as we're augmenting them with some of the higher-level policies. So the easiest example here is to look at a case that I hit all the time with folks that I talk to, which is I have an app on-prem. I have something in Cloud. Maybe it's an app. Maybe it's a third-party service. Maybe I want to use S3, whatever it is. How do we facilitate that connectivity? In a lot of organizations, if I'm on-prem and I want to do that, there's at least one firewall rule change and I go file a ticket and I wait six weeks and I get my policy updated and now I can go continue to deploy my app and consume the new services and all that good stuff. It really kills the agility and that's why we're moving to Cloud. That's why we're decomposing our monolith into microservices. It's why we're doing what we're doing today is to be able to deliver software faster to more users. So one of the ways that we can relax network policy and augment it with identity-based policy is doing something like inserting identity-based gateways on either side of our gap, on either side of our firewall. And there we can maintain one static firewall policy that says, hey, look, these two pools of identity-based proxies are allowed to communicate. And then we can augment that with identity-based policy controlling which applications are allowed to communicate over that bridge. And again, because we're in an identity world, these are easier to author policies, easier to maintain because they're human readable. Therefore, we can have higher confidence that they are correct and we can author them and approve them more quickly. The net result is that we can have more operational agility, right? Now, when the net new app needs to come and consume some new service, we publish an identity-based policy that says the front end can call S3. Not some cider-based rule change that we need to go look at the database that are looking at the spreadsheet and see what it maps to. And we can start to layer these and we already have a layering of network policy today. When we talk about multi-tier policy, it's not like network policy is some monolithic thing, right? We already have multiple layers today from coarse-grain segmentation like DMZ into the intranet, DMZ into the business zone and similar. And then we try and micro-segment on top of that usually. We might do that with subnets. We might even get down to the slash 32 if we're doing really, really fancy stuff. And then, of course, Kubernetes is yet again a different thing and it's its own virtual network and we use CNI to manage network policy there. And what we would like is for identity-based policy to sit on top of all of that, right? And again, we can start to relax some of these lower-level ones and augment it with identity-based policy to help get in the right spot for our organization to have the security concerns covered but facilitate developer agility. So as we talk about multi-tier policies, there's a bunch of different stuff to look at. Multiple network policies can coexist so we can start, like I just mentioned, we can start to layer different technologies. And again, as we're relaxing these, we wanna tighten up identity-based policy as a trade-off. The whole point of 207A in some sense is to justify to your auditors, to your regulators, to your security staff internally that this is okay to do. They sit on top, they provide defense in depth. And what we're just about to talk about is we can use a variety of different technologies to implement these controls. One of the big ones is, of course, the service smash and then I'll talk about that in just a second. That's also the subject of a couple of our other NIST SPs that I recommend. Folks, go check out. So how can we start to implement this, right? So identity-based segmentation is cool. Maybe I've convinced you you wanna start to look at it. If not, then as it becomes standardized, you will be convinced by auditors that you wanna go implement it. So what are ways? I wanna be, first off, I wanna be very clear, service smash is not the only way that we can implement this. And I'm about to talk about a few different ways that we can use a service smash to implement some of these capabilities. The ones that I mentioned aren't even the only ones that you can do aren't the only way that we can implement it with a service smash either. There's a lot of flexibility in here and that's by design. Those five things that make up identity-based segmentation, we are not prescriptive about how that happens in the SP. Although we do use the service smash as a reference implementation because we think it's a very effective way to implement the capability. And of course, hopefully most folks are familiar with the service smash. I'll give like a five-second review real quick. But in effect, it's a dedicated infrastructure layer that lets us monitor, secure, connect and manage our services. And we can leverage it. By the way, there's a couple of different SPs that you can check out that cover microservice security with the service smash. The 204 series, I also co-authored those. And we can use it and actually apply some of the stuff that we talk about in the 204 series exactly to implement identity-based segmentation. How do we implement the service smash and what can it do? This slide's actually a little out of date. Nowadays, we don't necessarily only have a sidecar and we're not necessarily exactly next to every application. We have things like Ambient that are a node-level proxy. We have things like Silium that are moving even more stuff into the kernel. So across the board with service smash is what we're doing regardless of the specific architecture of the data plane, we're deploying a proxy near the application that intercepts all the traffic in and out to allow us L7 control, allow us to do encryption and transit, allow us to apply per request policy and get telemetry out. And critically, these are not net new capabilities. I would argue to have any distributed system, you need some kind of service discovery and policy and operational inflammatory. The service mesh is novel in making those centrally-configured with declarative configuration. The mesh proxy becomes a universal policy enforcement point of pep. Because it's intercepting all of the traffic into and out of the application, we would even call it a reference monitor. So if we go back to some of the 1970s research and access control and secure systems and similar, we get to the idea of reference monitors and the security kernel. That idea, the security kernel is where we get the operating system kernel from. And if we think about what the operating system kernel does for applications that run on the same machine, the idea is that potentially the service mesh with its distributed policy enforcement points near every application can start to be the security kernel for our modern distributed systems. So all of our applications running in our distributed infrastructure can start to delegate capabilities to the mesh in a similar way to how we do that today with applications offloading things like file access and other just kind of baseline security things that we expect to the kernel. And then finally, one of the other big things is that the service mesh itself enables cross-cutting change. So we can have centralized control with distributed enforcement and we can focus the pain of different capabilities on teams that are specialized in dealing with that. That's what I mean by cross-cutting. So we wanna do encryption and transit. We need to integrate with our existing PKI, our existing certificate provider. There's a team that manages PKI and many organizations. They can be enabled to integrate the mesh with PKI one time and have that benefit across the fleet. And so we can use the service mesh exactly to bound attacks in space and time. We can use it to achieve runtime encryption. We can use it to authenticate and authorize service-to-service communication. And as we go back and look at our definition of identity-based segmentation, we can, again, MTLS, we can do service identity with things like Spiffy. We can do service-to-service authorization with capabilities like RBAC built into our system. We can do it with other, more sophisticated access control systems as well. And because the sidecar is a universal policy enforcement point, we can use it to enforce end user authentication and authorization on behalf of our application as well. So the service mesh itself is not gonna authenticate your user, but we can guarantee that every request that comes in is gonna go to the OIDC server so that it can be authenticated. By the same token, it's not gonna authorize end user resource access. That's not something that's tractable to model in the service mesh. You should have many users and potentially many, many, many resources. That's not something we typically write down in a YAML file or anything like that. Instead, we wanna use the service mesh as the policy enforcement point for our existing end user authorization system. We talk a lot about that in SB 800204A and B. So with that, we went pretty fast through this, but again, I want y'all to take away identity-based segmentation, five key things, encryption and transit, service off-end, service off-Z, end user off-end, end user off-Z. However it is that you do those things, we should start to do those at every hop through our infrastructure. And with that, we have quite a bit of time for questions. We have about 10 minutes or so for questions, comments, concerns, rebuttals, you can come up and heckle me, whatever y'all want. There's a mic over there and a mic over there as well. Hi, thanks for the talk and thanks for writing the NIST special publication. How do you think about multi-cloud or bridging on-prem and cloud? If you have an identity-based policy store, where does it live? How does it consistently enforce? Yeah, so that's one of the kind of open problems today. When we talk about identity-based policy, how do I have a universal identity domain across my infrastructure so that I can actually write that policy and make sure it's enforced? And today, by and large, we see replication or we see single-pointed failure, right? So I use Azure Active Directory and that's my identity domain everywhere and that's what we go to, for example. That's a very common one to do today. That's one of the ways that we see that. Of course, there's super interesting projects like Spiffy and Spire that are doing more things around identity federation and other mechanisms for tackling that are not single source of truth, right? But then, fundamentally, when it comes to many different sites and many different clusters, we do need some kind of a central coordination infrastructure. We actually talk about that in 207A and I didn't include it here, but there needs to be some kind of a system that's coordinating across the sites that you're deploying. That could be a CD system in GitOps that's keeping things synchronized across your clusters. That could be an online management plane that's actively looking and pushing policy, right? But in some way, you need a central coordination system that has a source of truth that either gets pushed out and replicated or that everything goes back to. Ooh, thanks. Hey, thanks for the talk. Two questions. First one's a simple one. Is there in some place I can find your slides? Yeah, so they'll be posted with the recording of this as well, I believe, and everything, and I'll tweet a link as well. Awesome, thank you. Second question, sorry, I'm hearing Echo myself. I assume intentionally kind of we're unopinionated about what specific technologies, specific frameworks you can use to achieve a zero-trust architecture. I'm curious if you're willing to be a little opinionated. Yeah, I mean like... Are there some that you think get close or do it really well? Yeah, I mean, definitely. So obviously, I hope to develop Istio. I still actively work in the Istio community. In 207A, we actually use Istio as the reference implementation that we talk about, right? So the point of all these standards is to be general. I don't want to be prescriptive on technology because otherwise it's not gonna be a durable, useful standard, right? And one thing I wanna point out is like, even things like this can be done differently. So I talk to folks all the time that do things like service identity in a jot, right? And they handle encryption in a different way or encryption is not needed for their risk profile, for example, and so they don't do that, right? So I don't wanna be overly prescriptive even in how we do this. But certainly, when I work with folks on this, we use Istio and that set of capabilities. We use Envoy as the sidecar to implement and enforce these policies. Cool, thank you. Yeah. I can think of several, well, that is distracting. I can think of several use cases where there's no end user agent. So how do you just need the first three or? Yeah, so usually in that case, that's something where we wanna use something like a service account to represent the job in our system. And that would likely be in addition to the actual runtime service identity, right? So runtime service identity might be a Kubernetes service account. But I probably want some notion in my identity universe. I want a service account or some other identity that represents the application itself, right? And so that would be the end user in this case. So just to give an example, I was there when we rolled this out in GCP. So we did these, we do these five, among many other policy checks we did this in Google Cloud and then Google infrastructure at the time we had low ass, which handled application identity. That's what Spiffy is based on Joe Beta, created the Spiffy spec originally based on low ass. And then what we did is, as we rolled out the service mesh everywhere internally in Google, we had that need for an end user. And what we mandated was everybody has to go get a service account, a GCP service account. That rep, you know, we have low ass to do certain workload to workload and the service account is what is authorized and authenticated to act on the system. And that way, and then you want that to be in the same identity universe as the end user identity. That way audit happens. That way the logging of access by which systems and all of that stuff that you want to do for end users equally applies for internal automated systems. Thank you. Thanks. It does seem in these implementations of service mesh and zero trust architectures that a lot is going on. Yeah. I'm wondering about the performance cost of implementing this type of technology and if there are any known ways to guidance on implementing them more efficiently. Yeah. So there's a lot of iteration on exactly that topic right now. So we're actually seeing in the service mesh space there's a lot of iteration on the data plane architecture. We see folks trying to pull it into the kernel or pull it into EVPF. We see folks like Istio Ambient trying to move some to the node and separate some functionality. So right now, I would say as an industry we're actually kind of grappling with what is the right set of engineering, design, security, performance, resource utilization, and trade-offs in the space that gives an acceptable cost to the right security profile and similar. So that's something we're actively iterating on, I think, today as an industry. There's a lot of different interesting technologies there. So obviously, EVPF is one that gets mentioned a lot. One that gets mentioned a lot less that I think is interesting is something like P4. So we can actually do things like HB header parsing in a switch if you really wanted to. And so a lot of the routing capabilities could even move into the fabric if you really wanted to accelerate it, right? And so there's a spectrum there, I think, in terms of how do we accelerate it and in which parts do we accelerate and how much. And that's kind of what we're playing with as an industry right now, whether that's something like, you know, Cilium and doing things in EVPF, whether that's something like Ambient, moving some to the node or similar. As far as the overhead itself, it depends largely on how much data you're sending, right? Because encryption winds up being the dominating cost, right? And so if you're not doing encryption, but you're doing all the other meshy things, then conservatively like 10% is a decent benchmark number. 10% of the applications resource utilization is a decent benchmark. In practice, it winds up being quite a bit lower for a lot of systems we deal with. Great, thank you. Thanks. I had a question about products, but we need the hardware acceleration, I think it kind of touches on the last thing from the cards. So from a, you know, so we need, you know, we set up a service mesh, it's great, no one uses it because it's not fast enough. So we end up passing like, you know, the cards directly into the containers, which kind of defeats everything. So do you know of any like Kubernetes products that are moving towards that direction to where you can get the hardware acceleration from the NIC to support the encryption and the speed and the checks and everything? Yeah, so that's still relatively decent, I think, but that is actually an active area. I was just talking with some folks about that earlier at this conference even. So I think that's like an active area where we're gonna see, personally it's not one where I'm like super qualified to answer today on like who's actually, where can I go use that stuff like in the cloud provider? Okay, thanks. Sorry. Earlier in the presentation, you had mentioned kind of context-based policies and networks and stuff like that. That seems to be kind of in addition to identity-based segmentation, where does that play in as far as cost and do you have another spec or reference architecture that you would suggest? So we don't, and when we talk, we in the paper expand a little bit more on what we mean by context-based, but in effect what we mean is that the access control decision that you're doing per request likely needs to include factors other than the identity. So that's what we mean when, and those factors are the context of that particular request. So maybe that's the IP5 tuple. That's part of the context for the request, right? Now maybe spatially in time, I've been watching this user access and the accesses have come from a place, a consistent geography or a consistent IP address and suddenly it's different. That's part of the context. And so in general in this paper, we kind of sweep all of that under the rug of you should authenticate the service and you should authenticate the user and you should authorize their access. And so we kind of get to be a little hand wavy there and a lot of the devil in the detail is how do we do that authentication? How do we do that authorization and what are the factors that we include in that risk-based assessment? Thank you. Thanks. All right, I think we have time for one last question, maybe two more. Hi, if you run the workload in a secure enclave, you can have software measurement that are proved by the attestation quote. Any thought of using that to be an ID that cannot be spoofed? Correct, exactly. So that exactly falls under that same category of like you should authenticate it. How should we authenticate it? Yeah, ideally, we want to trust all the way from the bootstrap of the hardware and the cloud providers do that and they give you some proof. And ideally, you want to use that proof on up the stack to authenticate your service itself. So something like the Spiffy Workload API starts to capture that kind of a semantic. Again, that's actually part of that context that we want to evaluate. Sure, are you in a secure enclave or not? That's part of a risk assessment of do I authenticate and authorize this access in this context, right? Cool, Wes. Great talk as always, Zack. You keep teasing about NGAC, Next Generation Access Control. And I've been seeing a lot in this space lately. We talk a lot about RE-BAC and kind of the bar-based ecosystems. Is this a mature space, at least in the open source world, or is this still in NIST land? Yeah, so NGAC is largely still in NIST land. I actually, yeah, we can maybe talk offline about RE-BAC. I'm one of the few people in the world that actually implemented stuff on Zanzibar and implemented some serious IAM on the real Zanzibar. And I have a lot of very strong opinions about that system. So definitely come ask me about that. In my opinion, NGAC is a much more compelling system. It has also sold a little bit of a toy. So there are some real implementations around not our open source yet. I hope there will be some active work that NIST is doing around that with some universities right now, among other things, to harden it up. But that is something that I would like to have happen. So certainly come talk to me and the folks that are interested in Access Control and in that space. Like, please come find me. I have a lot of very strong opinions that I would love to share with folks around how we're doing that and then what we're doing. Thank you all. I appreciate it. Thanks for the excellent questions, everybody, and have a wonderful rest of your KubeCon. I'll be around to answer questions and look for any of the tech trade folks as well. They can help you, too, if you have any questions. Thank you all.