 Hello, everyone. Welcome to Cloud Native Live, where we dive into the code behind Cloud Native. I'm Annie Talvastro, and I'm a CNCF ambassador, as well as Senior Product Marketing Manager at Camunda, and I will be your host tonight. Every week, we bring a new set of presenters to showcase how to work with Cloud Native Technologies. They will build things, they will break things, and they will answer all of your questions. Join us every Wednesday to watch live. This week, we have a really great session on locking down your cluster. I'm really excited for that, which is really great. As always, as a housekeeping item, this is an official live stream of the CNCF, and as such, it is subject to CNCF code of conduct. Please do not add anything to the chat or questions that would be in violation of that code of conduct. Basically, please be respectful of all of your fellow participants as well as presenters. With that matter handled, I'll hand it over to the speaker to kick off today's presentation. Thank you very much, Annie, for the intro. My name is Alejandro Pedraza. I work with Boyant, the makers of LinkerD. I work as one of the maintainers of this open-source project, and today, I'm happy to talk to you about LinkerD's implementation of mutual TLS and how we leverage the authentication that mutual TLS gives us to implement traffic policy. So these are some of my social media accounts. If you want to hit me with questions for the presentation, I would be very glad to answer them. But if you do use LinkerD or your part of the project, I really recommend that you join our Slack channel. If you have questions, go and answer it usually here, in the areas or by the community itself. So we've had to do an I.O under the community section, and you will find there the link to it. So first, I'm going to show you a couple of slides just to make sure we're on the same page about a couple of concepts. And then most of the time we'll be spending more applying all of these things. So I'm happy to answer that. I'd like to first talk about TLS. It gives us guarantees that we can remember using the PCIA algorithm. So first, competitive key, that means that connections are encrypted. So if anyone's listening to the data, if they can see the packets, they won't be able to send the entirety. That means that we have the guarantee that it is exactly what we get when we see it. And finally, authenticity, which is what we are most interested in today, means that the server person on the other end can prove who they claim they are. There seems to be a bit of audio issues. If you can do anything from your side, that would be nice. But I also advise all the attendees, sometimes I think there is a bit of audio issues that gets worse if you have multiple tabs open, for example, or you're using your bandwidth of your browser, for example, quite a lot. So I do recommend trying out those kind of fixes as well, or refreshing your page and so forth. That's always a good thing to do. But obviously, I can at least hear what you're saying, so we can continue as well. But if there's anything that you can do super quickly, you can be ready to do that. How does this come out? I think it's clear. It sounds a bit further away, but I think it might be clear. Yeah. Okay. I'll try something with her. Yeah. Oh, but no. That's better, Diri. I think this is a bit different. We can just go with whatever you felt the best, and we can see there's some people saying that it works okay. At least my voice is working quite okay. But we can try out something and then, okay, there's a little bit of it. But at least I was able to hear quite okay. So we can see. All right. I'll go on then. So I was talking about TLS, Convenient Shelly Entirely Authenticity. So there's an asterisk next to authenticity because usually TLS only cares about the authentication of one side of the connection. So most people are familiar with TLS in the context of web browsing where the server presents the certificate that proves who that company or that person behind that person is, the behind that page is. And usually the server doesn't care about the identity of the client. And if they do, they implement that higher up the stack at the level of the application layer through things like login forms or things like that. So in the context of microservices in clusters, we are interested in both identities, client side and server side. So for that, we use MTLS, which is nothing more than TLS with authentication on both sides of the connection. And what is what certificates certify? It's just a subject in the case of microservices in the case of Linkard's MTLS implementation. Our identity is tied to the service account of each one of the workloads that are getting MTLS. It's not exactly the service account. It contains the service account, but we will see in the demo exactly how that string looks like. There are some concerns around MTLS. For example, why do we have to implement this at the perimeter of each pod? And why not instead at the level of the host or at the cluster itself as a whole? Even though we might not have total control of what runs in our clusters, for example, regarding dependencies. So that's one vector where people can get into our cluster and start communicating with other pods that they're not supposed to. And so ideally we should secure things at the most granular level as possible. And this is in a few words, what the zero trust there was of these all about. There's also a performance concern. It is true that MTLS entails an additional handshake, but that only happens at the beginning of the establishment of the session. Also there's some CPU resource consumption whenever we have to decrypt incoming data and encrypt outgoing data. But really that's in practice negligible and it's worth the price giving the additional security we get out of this. But the most important concern is that it's really, really hard to implement this well. In particular, what regards the certificate management, there are multiple layers or certificates that you have to handle. There's the issue of distributing these certificates across all the cluster. And also you have to reinvigorate the new certificates as often as possible. Thankfully, LinkerD gives you all of these automatically. I think one of the principle reasons why people start getting into service mesh and production is this, just to have MTLS implemented without too much pain. And in LinkerD's specific case, we implement this through a three-layer hierarchical model of certificates. First, we have a trust route CA, which it is used to issue another certificate that we call the issuer certificate that is used by the LinkerD identity component. And it is using this certificate that we sign all the certificates that get distributed to each one of the pods that are required. These certificates that the pods received are shortly, they only leave for 24 hours and they are automatically renewed by the proxies. So let me really quick summarize how the LinkerD proxy injection works. Whenever a new pod starts, the first thing that happens is there's a webhook that is called, that's the LinkerD proxy injector component. It injects the proxy inside the pod and as a sidecar. It does so that it is first the proxy that starts, the first thing that the proxy does is send the LinkerD identity component, the service account token that proves that this particular workload is associated to that service account. Then the LinkerD identity component checks that against the Kubei PI and then returns the shortly certificate to the pod. And when the proxy finishes doing this, then the main container can start so that it has a guarantee that at the start any incoming or outgoing connection is going to be encrypted. If you want to try this by yourself, it requires zero configuration. If you tried it through our CLI tool, by default it's going to create the trust route and the issuer certificates. There's nothing you have to do to start playing with this. In production by contrast, we don't recommend using the CLI even though there are people that are successful doing that because it offers you all the features. But instead of that, we recommend using the Helm charts. The functions in the Helm charts don't allow us to create certificates using the FeliCurve algorithm. So that's why we cannot provide these certificates by default. But anyways, in production, it is always advisable that you yourself take care of these certificates, particularly the trust route and the issuer certificates, and you provide them to LinkerD. Or what a lot of people do is delegate that to something else, like a cert manager that automatically creates and renovates and rotates those certificates. And LinkerD provides a nice integration with them. Okay, so Howard thinks so far. Okay, better. We got to, as you can see, everyone said the audio is a lot better, so it's all good on that point. Right, all right. Oh, there's a question. How does certificate rotation affect existing communication? Okay, great question. And oh, by the way, we can see the restream studio right now on the live. Okay, no worries. Let me move that to the other screen. Okay, how it doesn't affect the communications at all. I'm not aware exactly of implementation on the proxy side of things, but as far as I know, it doesn't stop communication or add any latency at all. It's all rolled out seamlessly. So that NTLS deals with authentication. That's just one side of the security coin. The other side is what can we do once we know that a client is who they claim they are. So this is authorization. What is this identity allowed to do? We do this in LinkerD through authorization policy that we started shipping on our latest stable version. That's 211 that we shipped a couple of months ago. This is similar in intent to what network policy offers, which is a native Kubernetes resource. Network policy in a few words, lets you declare which pods can communicate with what other pods, given their network identities. So that's just basically IPs. You are also able to select pods through label selectors. But LinkerD, instead of that, uses workload identity, which is a higher abstraction that is much more expressive and lets you do things a little more on the app level of things. So we're gonna see a concrete examples in the demo. It is important to keep in mind that this policy is applied and verified on the inbound side of pods. So a pod will console these policies whenever an incoming connection comes in, not on the outbound side. So this will come later in the next stable version, but I think we can agree that most importantly, the policy on the server side is what matters most. This particular implementation will give you these tools to work with this. First, there's a default policy, which is simply an annotation that you can put on the namespace level or at each workload level. Similarly to network policy, you can say, for example, that the default policy is to deny all connections. So you would lock down all the connections inside your namespace. And then on a case-by-case basis, you can override that policy. And then we declare a couple of custom resource definitions for assist server that allows us to select what pods we're interested in, what workloads and what ports and protocols want, and server authorization, which is actually the rules themselves and to what servers they apply. So for example, you only want to receive connections on some port that uses DRPC and only coming from some other workloads that are identified through some specific particular way or you may allow all incoming traffic, et cetera, et cetera. So we will see detailed examples in a moment. So now I'm going to move to a demo. First, I'm going to talk a little bit about how we deal with MTOS and then we will see an example of authorization policy in action. Perfect, yeah, question here. There's a few follow-up questions after that. Do you want to take them now or do you want to take them at the end of like a whole session? Let's take one or two now, I think, that would be fine. Great. So there's another one that if we are talking about cert, what is the good way to secure those? And thank you so much everywhere the way for the question so far. How to secure the certificates? Well, for example, if you are installing LinkerD with the trust route that you create yourself, LinkerD only requires the public key. That actually is stored in a config map, doesn't require any particular security because that's public. The key, LinkerD doesn't require the key so you should store it securely outside your cluster. The issuer certificate which derives from that route on the other hand is stored as a secret in your cluster both the public key and the private key. We will see specific examples in a moment but yeah, they're just stored as secret and the best policy is to rotate them as often as you can to avoid any problems. Perfect, should we take, okay, there's three more questions that has come in, do you wanna take one now or do you wanna leave them to the end of the presentation? Let's leave those others till the end and we can go ahead with the demo so we can land with examples. Yeah, but keep the questions coming, we can get them to the end. So yeah, let's go to the demo. Okay, and in the demo we will see a specific example of how to isolate namespaces which is a common use case which means that you only allow traffic from within the namespace and anything coming from outside the namespace is denied. So let's jump into that, right. So here I have a cluster, empty cluster in my local laptop and we're going first and foremost install LinkerD. So it's very simple, LinkerD install gives us back just the YAML for the LinkerD resources so we send that, we pack that to quick control. We're also going to use LinkerD's V's extension which provides with the visibility that we will use in a moment. But keep in mind that that is not needed to implement MTLS. This is just, it gives us some extra visibility. Okay, and as I said by default this is going to create some word certificate and an issuer certificate and install it in our LinkerD installation. So let's take a peek at those specific certificates. As I just said, the trust route, it's only the public key, so it is stored as a config map, so let's take a look at it. So there we have it, the bundled certificate. We can inspect it closer. So we will do some J2 magic here. So we want the data field and see a bundle. Let's assert. And we want to get rid of the wrapping quotes. I made a mistake there, I think I missed here, quote. Oh, what's going on here? Sorry, I need JSON here. Okay, fine, that's the certificate. I need to get rid of the line fix. There it is, and I send that to step, which is a tool similar to OpenSSL, but easier to use. All right, so this is our certificate. First thing we see is that it lasts for a full year starting now, this is the subject, the certificate, very simple. And most importantly, it is a CA, so it is able to sign and create other derivative certificates, which is what Linkerd does during its installation. It created the issuer certificate that derives from this one, so let's take a look at that one. This one on the other hand, since it is both the public key and the private key, it gets stored as a secret. We need the private key so that it can itself create all the certificates for the pods. So that's stored under the identity issuer. The better, yeah, yeah, more. There it is, our public key. The identity issuer, so yeah, the public key over there, the private key here, let's inspect again that public key, it's good. It is made 64 encoded, so we decoded here. There's our certificate and we inspect it with step. This certificate also lasts for a year. It has the same subject as our trustwood and it is also able to generate other certificates below it. Okay, so now let's get back to our cluster and let's install a sample app, which is a multi-bottle, something that we always use for the demos. Let's take a quick peek at what it does. So these are the services that it exposes. Let's pull forward the web front end. Okay, this is a phone app that lets you vote for emojis. You can pick as many as you want, you can view the leaderboard. And there's some bot that is automatically voting, so we have some sample track that you can look at. Okay, I'm going to create a debug pod that will let us say a privilege pod that will allow us to inspect the traffic inside the host as a whole. So you can see what this app is doing. This is the node, and I'm using an image that I know that has TCP DOM, but let's do it for a moment. Okay, so I'm going to inspect whatever comes through all the networking interfaces. I wanna see any get requests, it pops this for a moment. You can see all these requests on plain text, so nothing is encrypted yet. These are the requests that are originating in our bot. So it's this part here, it communicates with our web front end, and that web front end communicates with the emoji component, which returns all the emojis that are available and the voting component, which actually persists the votings. So let's leave these here running at the bottom, and now I'm going to inject this app, emoji.boro, let's do that in a separate window. So I just retrieve the YAML of all the deployments inside the namespace. Get deploy YAML, yes, and I pipe that into linkergy inject. The only thing that linkergy inject does is add an annotation, so that when the pod gets created, the web hook that gets called, that's a linkergy proxy injector, we'll see that annotation and it will inject the proxy as a sidebar. And let's pipe that into the cluster. We can see now that as the pods get rolled out, we see two containers per pod. So one is the main container, the other one is the proxy that just got injected. And, okay, let's take a look at, okay, yes, here, we're still sniffing the traffic in the host, we no longer see that much traffic because there are no longer get requests that are out in plain pegs. We still see things like these health checks that are coming from the kubelet because the kubelet doesn't run inside a pod. So there's really no way to inject the proxy there, but I think that's fine. So we just achieve MPLS to this app. Okay, how else can we make sure that all these connections are secure? There's these vstack command that we'll inspect. I'm gonna shut this down here. That inspects in real time, it's as a sample of everything that goes through some specific pod. For example, here are the web deployment. So it shows us some of the headers and it guarantees us here that the connection is encrypted. We also can take a look at these dashboard. So we're interested in the emoji bottleneck space. And let's see the web deployment. So here are the incoming connections, the outgoing connections. Here at the bottom, we see everything that comes in comes out the identity of each one of those connections and whether they're secured or not. So here, they're all secured. Let's see. Let's take a peek at one of those pods. Again, web, for example. What can we see here? First, the first container we see is the proxy which gets injected before the main container. We see that the proxy has access to the certificate, to an environment variable. That is the trust-root certificate. Since all of the certificates in the cluster are rooted at the trust-root, we need the trust-root here in order to validate that those certificates from the connections that come into the pod. And that also of note here is a volume in the G and MPT. And MPT and MPT, okay. We see this volume here that gets injected by the proxy injector. It is a memory volume. And that's where the proxy stores its certificate. It's a 24-hour certificate. It doesn't store it in the file system. So in case there's a security breach or anything or there's a rogue process running in the host, it doesn't have access to the file system and see the actual certificate. Along with its private queue, of course. So there's no way to directly inspect that certificate like we did with the trust-root and the issuer certificate, but there's a command here, linkered the identity. So we are interested in multi-vote, code, the namespace and the web deployment. Okay, that's not it. Actually have to issue the entire name of the pod. There we go. So we see here the subject of the certificate. So this is the workload identity that we mentioned many times before. The service account of this pod is called web. I forgot to show you when we saw the pod demo. And then it's the namespace and everything that comes is the same everywhere. We have closer local at the end, which is the closer domain. By default, it's closer local but can be anything. So one thing to remember is that all the workloads in this namespace will share these topics and here the prefix will be their service account which usually changes amongst workloads. This certificate is not CA. So it's not, it's just a leaf certificate and it can be used for TLS authentication. Okay. All right, so let's dive in into policy. Okay, I'm going to start telling the log of the pod just to make sure that it's working properly. So all these means that it is able to successfully send these requests for these votes. And now I'm going to annotate the emoji about a namespace with a default policy of denied. So we can log deny all the connections inside the cluster, inside the namespace. So I'm going to copy that command actually. There we go. So the default policy is going to be denied for the entire namespace. This will only apply for new pods. It doesn't apply for the current ones. So we will have to load them. For example, let's roll out the web pod. So I'm just going to delete it. And we can see that both pod starts failing because it's connection is getting denied now. Okay. And another important thing here is that the, actually the web pod is not able to start fully, only one container is running. So let's take a look at what's going on. Here in separate window, it's this private web. You can see here, the problem is that the readiness and the lightness probes are failing. That's because we haven't granted access to the connections coming from the kubelet. So let's do that first. For that, I'm going to use this policy here. Where am I? Okay. So this is our first CR server that says that this applies to all the pods that have a port called linkerdeandme that uses the HTTP protocol. So this happens to be the proxy endpoints that take care of serving the probes. And we want to give, well, we create here the server authorization, which it is tied to that server, which is called admin. And the rule is just to open to letting everything. So that's unauthenticated truth. It means that everything that comes in is allowed. I think that's fine for pros. So let's apply that. And we can check that the authorization was actually persisted. So there's this command linkerdeofc returns a server resource called admin that is tied to an admin, everyone authorization. Okay. And now we see that the web pod is able to successfully run. We are still not, the bot is still not able to communicate with it because we haven't involved everything up yet. So let's do that. Let's call it deploy. And let's give it a moment. Oh, actually I forgot. Yes. This is not going to work yet because we only allow traffic for the pros. We need to actually allow traffic for the actual app connections. So there's this other policy app policy here. So we have this server resource which builds with GRPC connections here. So the port, anything that has a port named GRPC and that uses the GRPC protocol applies to this. In our case, that's the emoji and the voting pods. And we have another one here that deals with HTTP connections that's only relevant to the web component. And finally a server authorization that is tied to all the servers above because all the servers above use this label here emoji, voter, authority. And the actual rule is to allow all the traffic whose identity matches this which is all the workloads, all the service accounts that live inside this namespace. As I told you before, they all share the same suffix. All right. Let's persist that. And now let's roll out again. I'm going to pay the vote box lots again. And this time it is working fine. So all the traffic within the namespace is working. Now let's make sure that any traffic coming from outside the namespace is denied. So for that I'm going to install a parallel emoji, voter instance, but on a separate namespace. So for that I use some seed magic. So what this command does is it calls the same YAML we used before, but it replaces the namespace with emoji, voter two. That's all it does. So we can see it here getting started. Let's take a look at the, well, you can see on the top here, the vote, but still works because this is the old bot running on the emoji, voter namespace. We should instead take a look at the one running on emoji, voter two now. It's not working. Actually, so that bot on emoji, voter two is not communicating with the web part on emoji, voter two, but on emoji, voter. You can verify that checking it's YAML here called a bot. So there's a special environment variable here, web host that indicates where it should direct its connection. So we didn't replace this when we installed this. So it's still pointing to the old namespace. And that's why it is getting its connections denied. And we can check those denials directly on the server side. So that's on the web pod. We're going to tell the logs. The pod sees logs. You can see here the denials. So we see that the failing request is coming from this identity. Turns out the old bot service accounts is just the default one. And you can see the IP where this is coming. So this is dot 42. So let me verify that second. So it said what, 40, 42. And also my windows, 42, this one. So yeah, that's the vote bot on emoji, voter two. Okay. So this is all the visibility you get by default to catch all these denied services. Boyan does offer this SaaS offering called Boyan Cloud. It gives you a more detailed report on all these. You can set up alarms, notifications, to email, Slack, whatever, whenever one of these policies gets violated. Okay. All right. Yes. So let's get back to the presentation. Just a moment. Let's see my window here. Okay. So all the second part in specific that I just demonstrated, you can follow that up again in the blog post that I posted in the Boyan blog. I'm going to paste the URL in a moment. I just mentioned about Boyan Cloud. Besides that, those extra alarms, notifications that you get, you can also more easily manage certificates and get a host of other nice features. And as a bonus, you get full interd support. So if you haven't yet, and you are interested in using winter in production, I recommend that you ask for a demo at boyan.io slash demo. Okay. I think that's all I have for today. Can go on with questions. Perfect. A really great session on demo. Always amazing to hear about Lincardine. So then do the questions from the audience, kick it off with the ones that were asked in the beginning and let's see if we have any more coming in. Excuse me, my voice is a bit wonky today. So there was a question about, does this set up take care of a root CA verification or is it limited to the network calls? Root CA certification. Sorry, where is that question? I don't understand that, Chloe. It was from LinkedIn. Yeah, during when we were talking about the first question or the two first questions. But if it's kind of. So is it limited to the network communication? Or if the user is still listening in, you can clarify the question and answer it later. So what you are guaranteed here is that when you receive a connection from another pod, it's workload identity is what it claims it is. And we use the root certificate to validate that because all the certificates on the pods are rooted at the trust route. Not sure if that answers the question. Great, if you want more clarification, LinkedIn user just send another question in and we'll get to it soon. Great so far, no worries. So another one from Sundar, authorization policy. How do you relate this to author to and the identity management with tools like key bloke? Okay, yeah, I'm not familiar with key cloak in LinkedIn, we limit ourselves again to workload identities. I don't know if key clock can be configured to use service accounts as identities. If it is, then I guess it would work fine. Great, and I think there was a clarification. So the assumption is that the root search is automatically trusted. Yeah, that's where the root, where the trust is rooted at as the name implies. So yeah, if your root certificate is compromised, there's nothing you can do. So that's why the first thing you have to avoid is storing the key anywhere inside your cluster, the private key. Great, and then there was the last audience question so far, which way do you prefer? One certificate for all workload or one for each segment and what kind of trade-off do you think you will have given that we have order review capability? We do issue one issuer certificate per cluster in the specific scenario of multi-cluster. Each cluster would have separate issuers but they have to be rooted at the same trust route so that they can both clusters communicate. So that's really all the segmentation. Yeah, that's it. Great, and then I think there was a continuation from the authorization policy question. So if ignoring key cloak, how do you do our role-based access control in LinkedIn? LinkedIn doesn't offer anything specific to our back. It just relies on whatever you set up with Kubernetes. Yeah, one of the goals of LinkedIn is to try to leverage as much as possible the Kubernetes primitives without trying to invent so many things on top of it. So our back would just work as it should. Yeah, MTLS and authorization policy is really separate from our back because this is really at the application level. Our back is more, lets you define what you can do against the Kubernetes API, not so much as within the app, the connections within the app itself. Great, so that was it for the audience questions. There are a few questions that I'm gonna ask, but if there's any more from the audience, we do have a bit of time for those as well. So keep them coming. Thank you so much for them so far. So a question, what kind of certificate management is required to do this? Okay, it's very simple. You have to provide the trust route and an issue that is derived from it. And you just have to make sure that you rotate that whenever is needed. There's a linker to check command that will warn you whenever that expiration date is closed. Usually what people in production do is to provide these through external specific tools like CertManager that I also mentioned that will generate and rotate the certificates automatically. So it's very easy to integrate that linker. CertManager has also some optional hooks that you can also use in linker. So yeah, I think CertManager for now is the best practice regarding certificate management. Great, perfect. So how does NTLS compare to a lower layer encryption such as WireGuard? Yeah, the issue with lower layer encryption is that since you don't have access to higher of the stack stuff, then your only identity that you have available is the IP. So that doesn't give you much flexibility in what you can do and the rules that you can express. I think it would be hard for, in the example that we showed, securing up just a single namespace because you don't have a way to express which work rules you're interested in besides their IP, which is something that can change. So I would recommend that you use tools that are higher up the stack as possible so you can better express things. Great, and then there was another audience question. Notice that on one of the logs, the protocol says HTTP and not HTTPS. So is it possible to upgrade the connection to HTTPS for production? Yeah, that's interesting. EmojiBota indeed does use HTTP by default but you have MTLS on top of that so the connections are still encrypted. There's HTTP on the lower layer but that's wrapped around MTLS so it doesn't matter if the lower layer is not encrypted itself. We don't do, besides that, we don't touch what goes under the MTLS layer so we don't do any kind of upgrade in that sense. You could have your connections be TLS from the get go and Lincard will just let them through. It will wrap them as well in TLS so it can provide some workload identity but it works just fine as if they were not encrypted. Great. We have a few minutes left so final call for questions from the audience and final question from me. So how do you, I ensure that I always know which communication is MLPS and which isn't. Okay, we showed that a little bit with both a command in the console. There's this tab that shows you a sample of all the connections going through some specific resource and that will tell you whether they're MTLS or not and also have your dashboard, the Lincard dashboard that will show you some green checks for a given resource, the incoming and the outgoing connections whether they're encrypted or not. And if you said, like we saw, we set some policies to allow incoming connection from the namespace. We only allowed connections that have specific identities and that by itself implies that they have to be MTLS. If they're not, their connections would be denied. Great. Thank you so much. No new questions from the audience but we answered quite a many already. No worries. We got through them all. But yeah, time to wrap up then. You can obviously always continue the discussion in the CNCS Slack channel for Cloud Native Live. Everyone, so tune in there if you want to chat with everyone about these topics. So just to wrap things up, thanks everyone for joining the latest episode of Cloud Native Live. It was great to have a session about locking down your cluster. I really loved the audience interaction and the questions from the audience. Thank you so much. And as always, we're bringing you the latest Cloud Native Code every Wednesday. Next week, we will be having a break but we will be back in two weeks with another great session. So stay tuned for that. Thanks for joining us today and see you in two weeks. Thank you very much, honey.