 So I'm John Bellamert from Google. This is Yong Tang from Avanti. And we'll talk to you a little bit about how you may or may not realize you are leaking some of your service information in your Kubernetes clusters and what you might be able to do about it. So Yong, I'll take over from here. OK, to get started, before we talk about Kubernetes service information and DNS, let's just briefly reveal role-based access control in Kubernetes. So what is role-based access control? Ro-based access control is to define which user can do what in a Kubernetes cluster. The principles of RBAC is to have a list of privilege. That means you only want to expose information to users that absolutely need to know. The good thing about role-based access control in Kubernetes is that it has wide interference in shared environment. Let's assume you have a shared cluster with several teams. Each team has their own agenda, has their own features, has their own service to maintain. And let's say someday, one team decided to make some deployment, and they'd break up certain things. And of course, in whole cluster, in a non-working state, then everything will suffer. That's the so-called shared environment. But in role-based access control, you actually want to have a separation. So if one team messed up, it will not cause trouble to another team. Ro-based access control has been available in Kubernetes since 1.8. So it has been adopted by many companies across the board. However, in today's session, I'm going to discuss a little bit about a special information that is DNS-related information. The uniqueness about DNS is that DNS is actually outlier in Kubernetes environment. So how is that? So first of all, DNS information in Kubernetes is always going to be public. That is, DNS by itself serves as entry point for all the services, because it serves the purpose of service discovery. The services in Kubernetes will be exposed to our clients through DNS. DNS relies on UDP protocol, which makes things even worse. Because with UDP, you have no authentication or authorization. This is in great conflict with the least privileged principle we discussed just moments ago. So you have to fix that. So how do you fix that? There are several ways to fix. And depending on your scenario, there may be some very easy way to fix. Let's just hypothetically say, you work for a small startup. You have some great product you try to push. And because it's a small startup, the growth of the company is very high, let's say, 50% year over year, or even more. And in this case, your company's higher level management are going to push for new features every day. They care more about the growth. They care less about the cost or security. And because of that, normally the company will operate in a way such that you have smaller teams and each team working on their own, pushing features every day constantly. That's the so-called fast deployment. In a scenario like that, you're going to say, can we find an easy way to avoid the noisy, evil neighbor situation we discussed just a moment ago? Of course, there are some easy ways. We talk about shared information. We don't want information to be shared to each other. We also want to avoid the scenario where one team's trouble caused another team. Then let's just say, let's give every team a standalone coordinate cluster. The problem is solved. You're going to say, this solution seems to be a little downy. But the reality is, at least from my experience, when I talk to different companies across the board, I talk to different teams across different companies. And many companies actually operating this way. That is because if you have a company with, let's say, several hundred engineers and you have, let's say, 10 teams or 20 teams, each team only have 20 people, you don't have a dedicated team. Every team just work on their own. They just want to say, OK, so they don't want to share environment with another team because it's causing a lot more trouble to communicate with so many teams. It's causing a lot of trouble to do all kinds of things. Besides, I'm working on my own. I'm trying to push as soon as possible. So I want a dedicated cluster so I can release whenever I want. That's a smaller company with a hyper-growth. And the problem is, if you just give everyone a Kubernetes cluster, by the way, one thing I want to mention is that we all know Kubernetes can have a very big cluster. On top of the biggest cluster you can handle in Kubernetes environment, it's like 5,000 nodes. But on average, when I talk to different companies for companies less than 1,000 employees, most of the clusters are ranging from, let's say, 5 nodes all the way to maybe like 20 nodes. So that's another big cluster. So that's the scenario we encounter with smaller companies. But now we talk about what about if you work for a company that's getting a little bigger and the growth has been slowed down. When the growth of a company slows down, you realize a shared environment became a necessity. There are several reasons. One, the management, your company management is pushing for profit versus growth because there's no real growth anymore. And the future will be released less frequently, as expected. And finally, if you work for a team and you have been maintaining this service for a long time, you probably already adopt a mindset of say, if it's working, don't change it. No one want to make a change. Now we talk about a shared cluster and talk about a dedicated team. That's how great. Because with a dedicated team, with a shared cluster, there are several advantages. One, it increases CPU memory and GPU utilization. Two, you can update Kubernetes clusters, infrastructure, more frequently. This is going to increase the security coverage. And because you have one dedicated team to do all the upgrade, you are going to have less cross-team coordination. One issue I observe with the so-called distributed mode of each team have one cluster is that every team try to work on their own. They are getting excited at the beginning of the development cycle. But when service is getting into maintenance mode and you need to release a secret update, you'll realize that even just by sending notice to, let's say, 20 teams or 30 teams, you're telling them, hey, there's a Kubernetes version that's getting a little old and we had to upgrade to a new version, it's getting very troublesome because every team has their own schedule, their own agenda. They don't want to be disrupted by someone saying, you have to stop whatever you're doing and you have to upgrade, right? So in this way, if you have a dedicated team to maintain shared environment, your company normally have a better efficiency across the board. Now, that's a question we are going to ask. If you are the cluster admin and you work for this dedicated team, you'll maintain, let's say, 200 Kubernetes cluster or maybe just maintaining one gigantic Kubernetes cluster with, let's say, 1,000 nodes. So how are you going to fix the Kubernetes DNS information leakage issue? I'm going to hand over to John and he will do the magic, yeah. Thank you, John. Thank you, John. Okay, so let's, I'm going to do a little demo. We'll see actually how this problem, how bad this problem is. Okay, so I have, oops, notice here. What I have here is a kind cluster running. So we can see it's got some stuff running in it. Okay, there's a bunch of services running in it. And so you know, we can launch a pod. And what we see here is that that pod, because I didn't give it any particular service account or anything, it's going to get the default service account in the default namespace and that is going to, we have our back, of course, on our cluster. And so we can see that we have no access to anything. So the scenario here is, right, do you have a shared cluster with, as Yang was saying, you know, as your company grows, you may want to create shared multi-tenant clusters. You have users with access to that cluster. They can create pods, they can do what they will. But of course they have their access scoped to whatever their particular RBAC is. And typically that's to specific namespaces. So here we've got a pod that's just, has basically no privileges to the API. But even so, you know, it does have privileges to query DNS. So you can actually use that to glean information. So in the Kubernetes DNS specification, because it's sort of coming from the DNS world, it represents the services as this DNS schema and it will respond based upon things like, if a namespace exists, you'll get a new query for a record in that namespace. That means that the DNS name for that namespace exists and so there may not be a record. So I'm gonna query for just an A record that is a record for an IP address that matches a namespace name that doesn't exist. And you'll see that I get, doesn't exist. But if I query for one that does exist, oops, sorry, it definitely doesn't exist when it's completely wrong, still doesn't exist. And then if I query for one that does, we know CubeSystem exists. All right, I get no records. So it's a non-AR that I don't get the NX domain saying that there's no such domain. So what this is telling me is it's leaking a little bit of information that if I can guess a namespace name, I can determine whether, even though I have zero privileges on the actual API server, I can determine whether that namespace exists. Okay, that's kind of interesting but guessing a bunch of namespace names isn't really that easy to do. But if I think about it, I know I can probably use that to look up this, the API server itself, and I can see, well, that's got IP address 1096.0.1. You know what? That means that the service cider is probably 1096 slash 16. So there's another thing that DNS can look up, it can do reverse IP lookups. So if I know the cider, I know the addresses, I can do a bunch of reverse lookups. So, you know, slash 16 really isn't all that big these days. I can actually take this stupid script, 10 lines of bash code, not even bash shell, and I can look through that whole slash 16 and do reverse lookups. So let's give it a try. Oh, okay, found two things in the first slash 24. First slash 24, and it's gonna go through and it takes about six seconds with this stupid little bash script for each slash 24. So that's about 20 minutes. I don't wanna sit around here and wait 20 minutes. So we're gonna cheat just a little bit because I happen to know something that the attacker wouldn't know. And that's that I can look here and I can just start at, say, this particular slash 24 and we can try that again. And let's see what we find, okay. So it's scanning and oh, look at that. Okay, so I have a service running here. Getty is an open source get provider like GitLab or GitHub is a sort of open search clone of GitHub. And so I can see now, even though I don't have any privileges to the API server, I can see that exists while I can actually go and I can query for an SRV record. So an SRV record in DNS is not just an IP address that it returns, but it also returns a port. So every service that you publish in the Kubernetes API server, you typically will say what the ports are for that service and those end up in an SRV record. So if I do that, let's see the cluster.look, I don't actually have to type all of it out, I guess, but we can see, okay, it's got port 22 running. That's for Git SSH service and port 3000. I wonder what that is. Well, I know that Getty tends to run HTTP. So if maybe I can just try and do a little this, cluster.local, port 3000 and oh, look at that. So here I am. I can start to explore all of my other tenants services, what they have running and I can start poking at them and looking for ones that have vulnerabilities. And then of course I can crack into those. So we've leaked some information that can be useful to attackers through DNS. So what can we do about that? How can we reduce the leakage? Well, in, sorry, I'm checking my notes here. So we have a feature in QWERTY and S. It's not built in, but it's called the firewall plugin. And I can switch back to the slides. It's an external plugin and it's available, but it's part of the QWERTY and S organization, which means we maintain it as part of QWERTY and S. But when it's an external plugin, that means that the one that you're downloading off of it's running in your EKS cluster, the QWERTY and S or your EKS cluster or your Google GKE on-prem cluster. Or I don't know if OpenShift runs QWERTY and S or not. I think they do, but I'm not sure. But anyway, the one that's running there by default, yeah, okay, and is not gonna have this in it so you'd have to build your own custom image. Basically what it does is it's got a built-in expression engine that allows you to write simple expressions that look like a little like C. And you can also alternatively integrate with external engines like OPA. But it allows you to kind of make policy decisions on the requests that come in based upon metadata that we can associate with the individual request. So in QWERTY and S, we have another plugin called metadata that is built in. And when you enable the metadata plugin, it essentially tells other plugins, hey, add information to the context. So stepping back maybe a little bit, I didn't explain. The way QWERTY and S works, different from most DNS servers is that it's a request processing pipeline. So you get a DNS query in and it's accepted by QWERTY and S and based on your configuration, it hands it to a plugin which will either look up the information in some data source to satisfy that query or will manipulate or change the query in some way. And it or it will decide it doesn't have anything to do with that query and it'll pass it onto the next plugin in the chain. So you have this whole chain of plugins. Metadata is one of these that doesn't actually do anything to the query. All it does is basically create an entry in the go context. So in the programming language of go, there's a context that we typically pass down from one down through a chain like this and that entry in the go context tells the other plugins, hey, somebody's interested in metadata. The reason we do that, the reason you have to enable the metadata plugin is that of course this has a performance cost and in DNS, it's a very, very hot loop. We want tens of thousands of QPS and so on it from a single core. So anything going into that that isn't needed, we try to avoid. So in any case, if you enable the metadata plugin, then when your request comes in on this diagram, you see the request coming in from the left top there. It's from a particular client IP. It's for an SRV record for some query name. The metadata plugin doesn't change the actual request at all, it just sort of adds this metadata placeholder for the context. The firewall plugin on the way in doesn't actually do anything because of how we've configured it. But the Kubernetes plugin gets that request and says, ah, okay, that's for a cluster.local zone. So I own this query, so I'm going to resolve it and it goes and it query, it actually has a cache of the services. It finds that in the cache, but it sees that the metadata plugin is enabled and so it actually adds a bunch of metadata on that request context. Now this is a chain, a function call chain, right? So, on the way back, every single function plugin sees the response on the way back what the previous plugin did. So that request that now has the response, it's an, I guess I asked for an SRV and I'm giving an A record so my slide isn't perfect here. But the response here is an A record, but it also has some metadata on it. The firewall now can take advantage of that. So we can write a firewall rule within the core DNS configuration that says, hey, the client namespace does not match the namespace of the service being requested, so send an NX domain instead of sending the actual response. So that is actually what I'll show the configuration and I'll show that demo in a moment, or just now actually. So let's take another look here. So I exited out of that pod and I have here a couple of files. So if I, you can see that here, if we look, whoops. If we look, this is, I just pulled out the deployment file. This is, you can see a special build, right? We're not using the standard core DNS default build because it doesn't include a plugin firewall. I had to make a special build for it. And then I just pulled out the config map that's used to configure core DNS in this kind cluster and then I edited it and so we can see, there we go. Okay, so what did I change? One, I enabled the metadata plugin. Two, I switched to pods verified mode. I'll explain that in a minute. And three, I added this stanza for configuring the firewall policy. So the, basically it's saying allow this query if the Kubernetes namespace equals the client namespace or I'm querying for something in CubeSystem or I'm querying for something in the default namespace otherwise block it. All right, let's give it a try. We have to first apply that change. Okay, that updates the config map and then just to make it faster. So core DNS will actually reload the core file on its own but it's kind of whether it gets to each instance is a little bit, there can be race conditions. Let's just say that. And then so let's make sure that those pods restarted. Okay, 19 seconds ago. So those are restarted. So let's do our example again and let's try and grab, do it at SRV again. If I spell it wrong, it's definitely gonna, and boom, an extra name. So hey, it actually works. Amazing. All right, so, all right, that's interesting. Now I mentioned a moment ago, sorry, I just scroll my notes here. I mentioned I had to enable pods verified mode. So let's talk a little bit, let's go back to the slide deck. So it works, it's great, right? Why don't we just enable this? Why do I have to build a special core DNS and then why do I have to edit the standard DNS core file that's in every single cluster out there in the world? And there's a few reasons. One is the reason I need a special build is because a firewall plugin brings a whole bunch of dependencies with it that we didn't want in the main application because it's a kind of special use case. So that's why we keep it out and we don't build it in by default. But more to the point, why don't we build it? Why isn't it so important that we do it all by default? Well, one is you need to use pods verified mode. So what pods verified mode does in the Kubernetes plugin is in the earliest days of Kubernetes, there's a feature where you could request an IP address like 10.10.10.10.10.10.pod.cluster.local or something like that. And it would return 10.10.10.10. But it actually would return whatever you put for that IP address, it just always returned it because they didn't want to watch pods. So watching pods, if you imagine the way the Kubernetes API server works is where you have a, let's say, CordianS works with it and all other controllers is they create a persistent connection to the API server and they say, I'm interested in this information, send me all that information and then send me any changes about that information. So in a 10,000 node cluster with many, many, many tens of thousands of nodes or rather many, many tens of thousands of pods, a watch on pods is extremely expensive. You're pulling all of those pods into memory and then every single change of any pod that runs anywhere in the system basically gets pushed down to your process. So pods verified mode allows those pod queries. So the default in there that we changed it from was pods insecure, which basically replicates the ancient behavior. The default, which is no pods in there is just don't support pod names and that's actually what most people do nowadays, I think. But in any case, pods verified was put in to say, hey, that's actually really insecure to just return from DNS, any name because DNS is often used as a root trust type of thing. So like in your TLS negotiation, if you can effectively spoof that you have a DS name, you can probably play some tricks. So we're like, we don't wanna do that. We added this pods verified mode, puts a watch on pods, only returns the pod IP address if there's actually a pod with that IP. So that has terrible implications in large clusters. This is one of the reasons this isn't on default. There's also a more subtle issue and that's that there's a race condition. So remember, Core DNS, connection to API server, listening for things on pods, you launch a new pod. The notification of that new pod to Core DNS is an asynchronous process. So if you've got a pod that launches and immediately starts trying to make connections outbound and make DNS queries outbound and you've implemented what I've shown you here today, then those pods will initially fail those DNS lookups. So you'll actually get application failures until Core DNS receives the watch event, processes it, and puts it in its cache. So if your workloads are finicky and don't handle failure of DNS resolution very well, then this solution wouldn't work for you there either. So two big reasons why this isn't on everywhere all the time. I guess I have a note here. You can solve that last one by failing open but which means like in our policies that we write in the firewall allow unknown clients to access anything but that's kind of what's the point. So what in the world do we do about it? Well, I'm not gonna leave you here. I will say there is one thing you could do slightly better and that's pertinent DNS services and I just need to check time here. Yeah, last time I think. So the idea here is you combine some of this firewall concepts that we've talked about today but you also segregate your DNS instances pertinent. This only works if you have sort of large tenants and say that they're gonna be creating a lot of namespaces maybe and you can prefix those namespaces with the tenant name or there's work you have to do to make this happen. But the idea would be combine it with a mutating webhook so that you see when somebody's creating a pod in a namespace that belongs to one of these tenants, you mutate their DNS policy on the pod so you change the DNS resolver address that they're gonna use for those pods and of course you have to run a separate CortianS instance whose scope is limited to the namespaces of that tenant. So within the Kubernetes plugin in CortianS, you can list a set of namespaces and then we will only serve records for those namespaces. So that works for the tenant as long as the tenant doesn't need to access any common services. If they do, you need to write policies to sort of allow that. Typically you'll also still need to run your central DNS for all your platform services which aren't tenant services. And so you have to craft those policies maybe to only allow lookup of the QBAPI server. Typically, and other specific platform services you want those tenants to see out of the central DNS or even use network policy potentially to limit visibility to the central DNS for the tenant pods. That's what I've got for you. Any questions? There's two mics here and here. Can you hear them?