 Okay, so multi-service without a mesh. So I'm sure some of you are here because you really don't like service meshes. This talk is not actually going to tear down service meshes, but hopefully we'll get a little deeper understanding why people are using service meshes, and how you can solve it without a service mesh, and then you can decide do I need a service mesh? Is that the right tool for this job? Or can I take two or three components and get the pieces that I want? It's also a little bit of a call to make things better. One of the things I've been describing to people as a journey of building this talk is discovering that we're really bad at software. So I'm going to be calling out a couple of places where we really should be doing better. So, hey look, I've built multiple microservices, wired together. This is awesome. It's a nice simple architecture diagram. I've got a front-end service, I've got a couple back-end services, they each have their databases, they're nicely factored, they talk to each other through APIs, and I'm not ready for a service mesh yet. Can I just slap some Kubernetes on it, make it work? Well, we've got a DB operator, everything else is just out of the box Kubernetes. It's awesome. We're done. Okay, great, go home. Oh, wait, we're at a security talk. So, this is the part where I start to make you sad, and we will keep getting sadder over the course of the talk, and at the end, we will go to another talk and repeat the process. I don't know why you all are here. So, start with, it's not too bad. Our users are connecting to us over the internet. We've gotten ingress encryption. I hear that's an important thing. I remember Zach talking about that, and the little TLS encrypted to remove the NSA is listing on all your wires and all that stuff. But it's okay. We've got cert manager, we've got less encrypt. We can just go and connect things. Assuming that I'm using ingress or gateway API, if I'm using Kong's APIs or traffic's APIs or Contra's APIs or Istio's APIs, then I need to do some manual wiring together. It's not great, but do we really expect cert manager to expect to support 15 different ingresses? No. Gateway API maybe will give you those capabilities later in the year. I don't know. So, it's okay. Not too bad. Great. So, we've encrypted the traffic over the internet. Now, we're secure. We're safe. All the bad guys can connect to us through TLS, and nobody on the internet can sniff the bad traffic they send us. Oh. Well, we maybe need something like an API gateway to enforce stuff like rate limits, maybe API keys, and things like that in some form of authentication. So, we don't just want an ingress. We want something that has these capabilities. Some of the ingresses that you can get out there do have these capabilities, some of them up to a certain point, and sometimes you need to go beyond that, and you actually want an API gateway and not just an ingress or a service mesh. Great. Now, we just send all our traffic over the cluster, and our CNI lets us do layer four policies maybe. There's no feedback on your network policies to whether or not it's being enforced. You just say, I want this policy, and maybe you're prevented from sending out the traffic, and maybe you aren't. But that's okay. Some CNIs also support encryption. How do you tell if that's happening? Well, you sneak onto the cluster so you're not in a pod, and you do a packet capture, and you see whether or not the traffic's encrypted. Yeah, there's no interface in asking Kubernetes, is my CNI encrypting traffic? Is my CNI enforcing network policy? No. You just build your application for Kubernetes, and sometimes it's secured, and sometimes it's not. You really just ought to know your Kubernetes provider well enough to know the answers, and every developer should know which Kubernetes they're deploying to, and every HelmChart author should know which Kubernetes they're deploying to. Yeah. We may also want to actually be able to restrict access. So in this case, oops, went too far. We want to be able, the front end should be able to do orders and addresses to the back ends. But maybe orders needs to be able to read addresses, but the order server should never be updating someone's address. That's not right. So we actually need to establish identities for these things, and you've heard a lot about ZeroTrust, and I'm not going to tell you if these need to be ZeroTrust or not, but we need some form of identity that the applications can actually reason about. Service accounts, awesome. You can take a service account token, it's mounted in your pod, you can send it to someone else, and they can verify that you are who you say you are by going to the API server and asking, hey, is this a valid service account token? They can also go to the API server and do anything you could do. Do not send those tokens to other people. They let the other person impersonate you. Fortunately, I'm going to say this is a year, a year and a half or so ago, the service account projection API showed up, which lets you say, hey, I want an OAuth token for my service account identity, and I want it to have a different audience. I don't want it to be believed by the API server, I want it to have this other audience string, and then your destination says, oh, I'm looking for audience named orders, and so you get yourself a service account that says orders, and you send it over, and that's great. We kind of solved this with some open ID and so forth. Now we're sending these over the wire, and maybe our CNI is encrypted and maybe they aren't, and anyone who grabs them or sees them along the way can reuse them. So why don't we just not worry so much about whether or not the CNI is encrypted and set up a cluster local CA? That's been a standard enterprise solution for, I don't know, 20 years or so. You get a self-signed cert and you use it to sign your certs within your DNS names that aren't public to the Internet. Remember, all these names are service.cluster.local. If you want to go to Let's Encrypt and get a cert for service.cluster.local and they give it to you, that's going to be real exciting. You get to publish a blog post about a vulnerability you found in Let's Encrypt, and it shows up in the Certificate Transparency Log so everyone can see it happened. It's not going to happen. So you need a self-signed cert, which is great. Your other option is that you could use Spiffy, and you could roll out Spiffy and teach all your systems to use Spiffy and interact and manage that and so forth. It looks like you can pick any language you want as long as it's Go Java or Rust. So if you were using TypeScript, now you get to learn Rust and how to interact with it, or maybe Go or Java, but I'm guessing Rust is going to be the best of the three. In any case, might not be what you were thinking of when you were like, oh, I'll write TypeScript front-end so that I can write TypeScript, and then I can write TypeScript in the browser. In any case, we've now said, oh, identity problems are hard. I know we'll simplify to a key distribution problem. We'll say, oh, now we just need to get this CA cert to all the different places. Mounding the certificate and, okay, I'll serve the certificate is not too bad. Getting a CA cert into an arbitrary Docker container is not great. Go lang is not too bad. You say like SSL cert, do all my certs are here and it will just pick things up and load them. Open SSL, it's not too bad. You said SSL cert, do then you need to take all the certs and you need to hash their names to the special hashing format that open SSL wants. You get a directory full of names that are seven characters.zero and a sim link to your cert. Yeah, you could do this someplace else. You can do it in a container. This needs to happen before your binary starts. So you can't just take a container using open SSL and magic a cert into it without a bunch of extra work. In Java, there's a different tool that you need to load things in that's Java specific. There may be other TLS libraries that have different ways of doing this. I couldn't find any well widely used ones. But here is my private CA, please trust it, has been a problem for 20 years and we apparently still don't have a standard interface. If you pick a service mesh, the service mesh controls all this. So you have one implementation and it's whatever your service mesh is implemented in. But now you can see why you might want to do this. Cloud native build packs and service bindings have some stuff that you can stick in. But again, you're starting to build a big layer to replace that service mesh. So maybe service mesh is a good choice. But it can be hard to roll out a service mesh and have part on and part off and figure out how to ramp all this stuff up. It's getting easier, especially if everything is in Kubernetes. But it's still hard. Sidecar injection has a lot of problems. I know Istio ambient mode is the new fix for that. It looks a lot like a CNI, to be honest. We run one agent per node and it potentially encrypts things and it doesn't tell you if it did it or not. Fortunately feedback may be slightly better, but it's still potentially a lot. The resource requirements can be high. I've heard people complaining about how much CPU they're spending on envoy proxies instead of on their application stuff. It's got a lot of functionality, which is great if you need that functionality. If you don't, there's a whole bunch of complicated things that you need to figure out, do I need this or not? Is it important? Some organizations are going to take one step at a time. They're going to figure out how to get on Kubernetes, and then they're going to figure out, do I need a service mesh and if so, what does that look like? Some organizations are happy to jump in with both feet to all new technologies. Sometimes those migrations work out great and sometimes you're three years down the road and you're like, still nothing works on my new platform. Can I go back to the old one now? So there's something to be said for adopting one thing at a time. Another reason might be multi-tenancy. So let's say I want a shared Redis service and I want to have, okay, this namespace asks for 100 megs of Redis cache. This namespace asks for 500. And then I'm going to have one pool of workers behind it that service these things. And so I'm going to have a router that comes in and you want that router to present a service in each one of those namespaces. And service mesh really wants to know what your underlying topology of pods are. And you're like, no, no, this is an independent Redis here and an independent Redis here and an independent Redis here is the presentation you want to make to your developers. And the services mesh is like, no, they all have the same identity. It's all the same thing underneath. And for both of these cases, you can tunnel things through with TLS, SNI. But if you're running one of these multi-tenant services, it breaks stuff like network policy because network policy says it's all going to one port. It's all one L4 path. Token projection, if you're a multi-tenant client, if you go out and call different things, token projection wants to protect one service account that's the service account your pod runs as. And there's no way to say, I'm authorized to use this other account. Please give me a token for something else. Spiffy, again, it really wants to attest you are this pod, not you're working on behalf of this service. So you can roll your own TLS, SNI for this. Let's talk a little bit about how we can undo some of the damage that we did by wanting to go back and share resources across the cluster for efficiency's sake. Full disclosure here, I work on the Knative project. Knative effectively does this for your pods because sometimes we scale you down to zero. Sometimes we scale you up real big. We have a central shared piece because if there's something, if you had to run something for each deployment where you said, hey, you don't have to run your pod, but you have to run this other thing and sometimes you run your pod too. Everyone would be like, that doesn't save resources, but if we have one central component and you have 10 namespaces which scale up and down to zero, that central component can give you big savings. So network policy. I was just saying, these are all the same TCP port. Well, unless you allocate a whole bunch of different ports, one for each service on your router instance, and then you can see here, this looks like this name Redis Foo is on this port and then if I create a bar instance, I'll name it four bar and it'll be cluster IP and the target port will be maybe one higher or something like that. And then when the traffic comes in from the CNI, I can say, oh, it came in on port 16, 8, 4, 4. That must be for this service and it came in on this other port, it must be for this other service and so I can go back to using layer four policies again. And I can use TLSS and I in combination with this or not, but it's nice to have defense in depth options and when you route everything to a single endpoint, inside the cluster, you can't use the destination IP address because SNI has stomped that and you can't use the sender because you don't know who that sender was. It could be, you might be able to figure out which node it is, but a pod can appear and disappear at any time, so you don't really wanna say, oh, I looked and at this moment it was this, maybe you're looking at old sale cash or something like that. So destination port can let you recover this. So token projection doesn't work, but if it's important for you to be calling out, I don't have a great example of this yet, but I've been thinking about it because I fear that I will in the future, you could actually be your own open ID provider and say, hey, I'm going to be this client or that client or something like that. There might be a Kubernetes API to do, I can act as something else. I know that AWS has that, but I haven't seen any caps suggesting that impersonate another service account is on the Kubernetes to-do list. So this was the best I could come up with for token projection. And sometimes you might need some of these and a mesh. If you wind your head way back to the beginning, we talked about why you might want an API gateway and you might want to do rate limiting on a per API key basis. And that might be something where you'd rather keep that on a central API gateway, even if you're using a service mesh. So you might have traffic comes in, service mesh to API gateway, to service mesh to your application backend. And that might be your right choice. And now, if any of you were thinking about, oh wait, isn't that a lot of Envoy CPU? Yeah, we talked about that earlier. And I'm going to call back as well to Zach's five points in the keynote yesterday. Pretty much every service that has user data and some sort of authorization model is in a way multi-tenant. If you think about it like something like Facebook or your shopping cart on Amazon, like that is a service and my shopping cart isn't your shopping cart, isn't your shopping cart. So that's really a multi-tenant service. It's not an API controlling compute things, but you still need a lot of that same authorization logic. And so you actually will need to get your application involved in the security even if you have a service mesh. You may be able to make some course-grained rules that X can't talk to Y at a little bit more of an intermediate level than network policy is. So rather than saying in this diagram, orders can talk to addresses or orders can talk to addresses, you can say orders can only talk to addresses with its own identity. It can't go and impersonate a customer or something like that. And so maybe orders has a special look up in addresses to go and get things out and that's how you've decided to structure things. But addresses wants to know who's calling it and what permissions should they have and are they allowed to access this or that type of data? And it might be even down to a field level. You know, I let orders view some of the address data but not historical data. Whereas the user can see like, oh, these are the past addresses I had. And yeah, you could do a lot of that logic in mesh, whoops, I don't know why that happened. You could do a lot of that logic in the mesh but imagine writing Istio policies like this for every single user. Your mesh would die, you would hate your life and it would be much easier to build this into your application. So maybe your mesh can answer some of those lower level questions about who's talking to me but you need some higher level questions about what's the application supposed to do that you actually don't wanna pull out into your service mesh. I talked about API gateways earlier, this is more of the same. There's a lot of stuff that you can get out of an API gateway. Some of that stuff is maybe if you built your application today you would need it. Maybe in some cases it's the right thing even today because if you're using something like Apigee you can call out to an external system to say, hey, does this customer have this property or something like that? And building that into your application might be less disgusting than building it into your API gateway. But you probably don't want every single node on your mesh loading these policies because some of these policies are complicated and crazy and yes, you could push that wasm everywhere but why do you wanna do that? And so the last piece here is what I've been alluding to earlier. There's two different levels. At the network policy of the mesh level, hey, is this allowed to come in and what server on the network sent it to me? And there's a second level of what user made the request and so you actually need to build both of these layers and then you can start to make more complicated decisions as well like, hey, is a user coming in from the orders system able to go and update their call an API to update their address? And you're like, well, no, it doesn't make sense for that to come from the order system. That should only come from the front end and maybe in the future, it's like it could come from the front end or it could come through customer support. And so maybe the customer support portal becomes another top level network policy source that you can consider and you're like, oh, if it comes through customer support then I need to write this audit log over here and if it comes from the front end then I don't need to write the same audit log because presumably user was logged in as opposed to someone called and convinced customer support they were the user and that's a social engineering attack and we are never going to get computers to be able to protect us against that because it's about humans talking to humans. And so a service mesh will never solve that because it's technology. And I love that we are building more of this technology and that we can close down the space of attacks that computers can do but we need to think about the attacks that people do too. So yeah, in summary, everything is pretty awesome. We've got a whole bunch of really cool technologies. We kind of trip over the basics. So we've got these awesome service meshes but we can't get a CA cert into a random container because that would require talking to Java people and to go people and to C++ people and to Python people and getting all these libraries to agree on a place even if it's not the place they used to do it. Authentication and identity is hard. Should you use Spiffy? Should you use cluster API token projection? Should you use your cloud providers managed identity service? I can't answer those things for you but there's trade-offs. None of them is great or perfect but we've been working on it for 40 years so I'm sure the answer's just around the corner. Authorization is hard. Figuring out the right place to make an authorization decision is hard and we can build tools to help you authorize things but you need to figure out which are the right tools and the right places for it and sometimes the answer is gonna be you write your own code, you write your own policy engine and yeah, maybe chat GPT will build me a policy engine that can make the right business decisions someday but so far I think our job security is actually pretty good and yeah, meshes will solve some of these things as long as you're willing to put a lot of trust in the mesh. That makes me as someone who believes in defense in depth a little bit nervous if it's, and all your trust is in X. There was a talk yesterday that was talking about okay so you get encrypted communication between the mesh but then the mesh to your application is clear text and that's over local host so you can only snoop it if you get like root on local host which at which point I can also just rip it out of your application's memory like dev mem is a thing but it also means that my application doesn't understand whether the other end is encrypted or not and so I go make a connection and I think oh this is great and then I end up on a cluster where there isn't a service mesh and there's no feedback that says hey wait you tried to make this connection and you wanted it to be encrypted and it's not I have no way of hinting that to my environment so service meshes make me a little nervous in that sense because it's hard to see that you're getting the thing you want and again working around all of these may add some new challenges in terms of getting all your identities out there and all your certs out there and stuff like that and so I bet we'll be here again in well maybe not in Seattle but I bet we'll be having more cloud native security cons dealing with all of these problems for another five or 10 years at least and then maybe I'll do something else other than security and new people will be here maybe they'll figure out the problems maybe we will I actually have confidence that we can make these things better but we're gonna have to focus on not just the fancy new shiny stuff but nailing the basics so thank you and sorry for the mess on starting on time but yeah I'm happy to take questions and stuff with people people have anything or any corrections you know it may be that there are good answers to some of this stuff that I've just missed I was talking to someone over at Keyless I think earlier and they're like we have a great library for that stuff and I'm like great we have another library for SSL like it's good to have some diversity there you don't want back when Heartbleed happened I remember it affected basically everyone because there was one SSL library and everyone used open SSL and so if there was a bug it hit everybody and so the fact there's boring SSL now and a couple other implementations I think is good but if the interfaces aren't standardized on either the file system or the application it's really hard to move between them and I know I don't have the right people in the room for that but part of what makes it hard need to find the right people yes so okay yeah I'm gonna I'm gonna repeat the question so it picks up on the recording the question was do I think that improving service mesh or finding a new replacement concept is a better way to solve this and I guess I'm I try not to define my solutions ahead of time and instead define my requirements and see if a solution's gonna match and so I guess I would call out some of these challenges and see whether they are fundamental to the definition of a service mesh things like I can't tell if the service mesh is there or on which is also a problem we have with CNI I can't tell if CNI is encrypted I can't tell if CNI is enforcing network policies my helm chart can't tell that when I install so maybe I install a bunch of network policies and it means bupkis maybe there's already an encrypted CNI there and so I'm doing TLS over an encrypted CNI and yes I've got two layers of encryption and they don't cancel each other out or anything but I'm spending more CPU than I would need to and we have no ways today to really detect and understand that in a standardized way and if Kubernetes is really going to be the orchestrator for all of these things I think it's kind of I think Kubernetes should define some standards for figuring these things out and that probably means that in four years we will have written some caps and we will have implemented some caps and they will have passed the beta stage and be on by default and they will have rolled out to your cloud providers and you will have upgraded to those versions it's a long road but if we don't go down it because we see it's a long road it's not gonna get shorter it's not like people are gonna wake up tomorrow and be like hey I think we should make Kubernetes release more often and with features that are less thought through and stable that pressure is never going to win at this point so the best we can do is start today and look at how can we do better and yeah as I said it's not clear to me like Istio, Ambient, Mesh and a CNI are starting to look closer and closer together I don't know what that means maybe in five years we can look back and we can find out what it meant I also am a big believer that everyone here is trying their hardest and we're all doing the thing that looks best from our own position so this is really not a oh X is bad or Y is bad other than to point out that like why have we not figured out how to do custom CAs in 20 years but tomorrow could be better good if I can rephrase it a little bit the question was have I considered joining one of the steering committees and attempting to work on this I kind of figured this out three months or started figuring this out three months ago I'm still trying to figure out what I wanna do with the knowledge that I have but one of the things I did was I put together a talk presentation and submitted it to the conference but yes I am starting to think about this and I'm not even sure what the right steering committee is on one hand it feels like if I wanna fix the CA cert stuff I need to go figure out the folks in the Java community and the folks in the open SSL community and so forth and push them and on the other hand if I wanna fix stuff like the CNI stuff then you know that's probably a different set of people I want this stuff to be better and so I'm willing to put some effort in so well now I have made up my missed time at the beginning y'all should have yelled at me too you're like why are you doing over there and I'm like oh I thought I started at 2.10 occurred to me everyone was staring at me about in a funny way okay