 Yeah. Thank you for the welcome, Matt. And thank you, everyone, for attending. We are going to be looking at an end user threat model for Envoy Gateway. My name is Tuan Vannenburg, Cloud Native Security Engineer at Control Plane. And this is my esteemed colleague, James Callahan, also from Control Plane. So a little bit about us before we get started. Control Plane is a cloud native security consultancy that specializes in cloud Kubernetes and containers. Our clients include government actors, as well as financial services and highly regulated industries. We have just over 50 people across the UK, Northern Europe, and Australasia, and expanding. And to get us started here, we're going to go over a little bit of an overview of what we're going to do, what this work is, as well as what our current state is. So what is our scope here? We are commissioned by the Linux Foundation to do this piece of work. But we like to say that it is not a security review or an audit of the project itself. It is a threat model based on end user implementation and security considerations revolving around the actual use case of Envoy Gateway itself. And a huge thank you to ARCO and the team at TouchRate with Matt and Zach, as well for their ongoing collaboration and feedback throughout this process. So through our presentation, we'll give a quick introduction to what threat modeling is, why it's important, and why it's helpful. And then we're going to go through three specific demos, attempting to show realization of some of the attacks that we've enumerated in our attack trees. And some next steps forward from this presentation. We look to finalize a threat model report and specifically update and refresh based on the recent 0.6 release a few days ago. So what is threat modeling? Threat modeling is an interactive exercise that tries to bring everyone to the table, not just security folks, at identifying and enumerating threats and vulnerabilities. It looks at devising mitigations and security controls to reduce residual risks, as well as escalate the most important risks that you've found pertinent to your system. So you might ask, why threat model? And a lot of the value here is actually identifying security flaws and risks early in your system design process. So thereby saving time on consuming redesigns of your system that otherwise would have had to happen if you did not consider the security risk to begin with. And it helps you to focus and fine tune your security requirements to actualize risks that are mapped to attack trees and identify complex risks and data flows. Kind of like what I mentioned, everyone can and should threat model, not just security teams. So into the process that we follow, we follow a four-step iterative process that starts out first with what are we building? So here you need to actually define your scope, understand what your system is, your data, your adversaries, and your architecture to be able to actually begin the threat modeling process. Here you will build data flow diagrams, as well as system architecture diagrams to actually understand what it is you're working with and how you can begin to build a threat modeling exercise off of that. Moving into step two, we look to catastrophize your system. So what can go wrong? Typically, we'll use open source threat intelligence sources, as well as brainstorming and techniques like stride to enumerate threats, and then building attack trees to understand complex threat vectors based off of that and how the two map between themselves. In step three, you get actually some of the solutions here, right? So we will begin to build a risk management strategy based on your use case and your system architecture. We will devise controls to help mitigate your threats and then implement controls based on that risk management strategy that we had set out at the beginning. In fourth, and arguably one of the most important steps is that iterative process here. So did we actually do a good job in our threat modeling exercise? So here we will map controls to the attack trees that we have elucidated. And then in many cases, you'll build automated tests to test the control implementation and sometimes as well, offset and pen test exercises to further enforce what you have provided as far as attack trees and the controls that you use to break those attack chains. And it is highly encouraged to revisit your threat model regularly to keep it relevant. So to start us out and before we kind of dive into Envoy Gateway itself, we kind of need to understand what is Gateway API. And this will come into play later in the presentation as well as we actually begin to categorize some of our risks. But Gateway API, as some of you may know, is an open-source project maintained by SIG Network. It is intended to be expressive and the next generation and iteration of the Kubernetes Ingress API. So we have here a very high-level overview, an architecture diagram that you can find on the Gateway API docs. But at its base level, Gateway API is composed of three resources of Gateway classes, which act as a decoupling between the actual mechanism, or in our case, controller, and the Gateway implementation itself, and provide a class of Gateways almost a templatized configuration for your load balancing resources. Gateway is the actual load balancing resource itself and that ingress mechanism. And the routes are what actually brings in routes your traffic through to your services and different namespaces on your clusters. It is intended to be a role-orientated API. And its security model is based on three primary personas. There is also a four-tier model that includes application admins that we'll kind of get into in the next slides. But at a high level, it's looking at your infrastructure provider, which will manage the highest level of resources and have the highest level of permissions with Gateway classes, Gateways, and routes, and then your cluster operator, which sometimes may configure and have access to work with your Gateway class. But in most cases, we'll be working with your Gateway resource and your routes. And at the lowest level, your app dev, which should primarily be concerned with the actual routes that provide service to the applications that they manage and develop. So now we kind of understand a little bit about what Gateway API is. So what are we building here? We will, in this thread modeling exercise, we build a demo deployment of Envoy Gateway with a few different deployment apologies that James will get into a little later. But at its base level, Envoy Gateway implements Gateway API, where Gateway classes are cluster scoped resources that are intended to be managed by the infrastructure provider itself. And each Envoy Gateway controller will accept a single Gateway class resource. Gateways are intended to be namesay scoped and they implement the Gateway class itself, as I kind of mentioned, as a templatized configuration. And routes are bound to a parent Gateway and then provide downstream traffic to the services and different application namespaces for developers. So Envoy Gateway operates on static and dynamic configuration. And the static configuration is used to configure Envoy Gateway at the startup and should be stored in a secure source control management repository. And Dynamic Config is used to reconcile the desired state and to provide dynamic updates of the state of the data plane via Kubernetes resources, typically through applying manifest or some sort of GitOps deployment. So some of the threat actors that are considered in the security model for Gateway API, this is the four tier security model that operates on a multi-tenant context. First, on the infrastructure provider who have access to your Gateway class, your Gateway, and your route. The cluster operators, as I mentioned a little bit before, sometimes have access to your Gateway class. But in most cases, we'll be working with your Gateway and route resources. For your application admins, they will not be working with your Gateway class. In specified namespaces that they are administered to manage, they will work with your Gateway itself and then the routes. And at the lowest level, at the application developers, they should only be working with and concerned with the actual routes that provide service to their applications. So I'm going to hand this over to James. Thank you very much, Chorrin. So as Chorrin has said, we are going to look at a multi-tenant deployment model. Because in threat modeling, we want to elucidate the most threats. So we want to make it more likely that something can go wrong. We can obviously do this more easily in a multi-tenant setting where we have a new threat actor, which is the co-tenant. So what we're going to do, and this is all spun up, this infrastructure is spun up. If you go to that repo that we mentioned at the start of the talk, threat modeling Envoy Gateway talk in the control plane organization, what we're going to do is spin this all up on a kind cluster and look at a few different deployment topologies for Envoy Gateway. We're not suggesting at this point that one is better than the other. We're going to find that out through threat modeling and catastrophizing, as Chorrin said first. So on the left-hand side here, we have a single tenant called Tenant A, very imaginative. I know. And Tenant A has their own dedicated gateway class. They have their own Envoy Gateway controller in a namespace which they own, Tenant A namespace. Again, our naming is just so imaginative. And this Envoy Gateway controller provisions a dedicated Envoy proxy for Tenant A. And they also create their roots in this namespace. And their back ends are also in this namespace. So a very, very simple single namespace model. We see there is traffic coming out of that Tenant A namespace to Tenant D namespace. We'll get to how that is actually done through resources later on. And this doesn't, Tenant D doesn't have to be a tenant in the strictest sense of the word. It could just be another namespace which Tenant A have some control over. But again, we want to look at the worst case scenario. So we'll consider the case where we have a root to another tenant. On the right-hand side, we have a shared gateway approach where we have a shared gateway class. Again, this is a cluster-scoped resource. We have a shared Envoy Gateway controller in a dedicated namespace provisioning an Envoy proxy for shared tenants in that dedicated shared namespace. We then have Tenants B and C sitting underneath and their traffic is all using this shared Envoy proxy. Let's have a look then at how we get these cross-namespace roots. So there's two ways. The first one is the shared model that we talked about before. So in this shared model, we need a gateway listener. This is the shared gateway listener, as you can see from the namespace. And this shared gateway has a listener which has a allowed root. And in the allowed root, we match on namespaces. So we say in Tenants B and Tenants C namespaces, we can create HTTP roots which bind to this gateway. So then we have an example on the right-hand side of such a root. We're looking at Tenants B here. And we can see we're just rooting TenantsB.example.com to a back-end run in Tenants B's namespace. So that's one model. The other model is using reference grants. In this case, this is related to the left-hand side of that diagram before. We have a Tenants A gateway in this case. We have a Tenants A root. But that root references a Tenants D back-end. And then Tenants D need to create a reference grant in their namespace in order for this to work. Otherwise, the traffic will not get rooted there. So what we're seeing is a model where in both cases, Envoy Gateway has this built-in. It's a handshake mechanism where to get cross-namespace traffic, you need some resource to be created in each namespace. So we have inbuilt protection against malicious cross-namespace roots built-in to Gateway API and, by extension, any implementation. However, that doesn't mean that we stop threat modeling here. We still want to understand what would an attacker have to do to create a malicious HTTP root and why would they want to do this in the first place? So let's consider a scenario where we have a malicious Tenants C. Remember, they were sitting underneath the shared gateway. If Tenants C were malicious, they could perhaps think of a scenario where they want to direct legitimate users of a Tenant B service to something that looks like a Tenant B service, but actually it's served up by Tenant C. If they can do this, they could maybe try and steal credentials, try and get someone to log into a cloned page, something like that. So let's look at what this would look like. First of all, we spin up the infra in the diagram and we set up port forwarding to our Tenant A gateway so that we can hit Tenant A resources on localhost 8888. And then we'll just curl our Tenant A example and it's just an Echo server and you can see at the bottom it's being served up from the Tenant A namespace. So let's do this again now for our shared gateway. We'll set up port forwarding so we can hit localhost 8889 and we'll just take the example of Tenant B. So we'll curl TenantB.example.com and again, we will see that it's served up as expected from Tenant B's namespace. However, what happens if Tenant C can create this malicious HTTP route? So we can see it's created in the Tenant C namespace but it's referring to TenantB.example.com slash totally legit. What happens if I can then socially engineer a Tenant B user to go to this URL? Let's look at this in action then. We are going to apply our malicious HTTP route and then we're gonna curl TenantB.example.com slash totally legit and you will see we have been foiled. It's served up by Tenant C. So this is one example of how social engineering could be coupled with a malicious cross namespace route and we can codify this as Torrin said before in an attack tree. So we're gonna do that now. Let's ignore the left hand side of this slide for now and just focus on this attack tree and let's just look at the top two lines. So we want to route traffic to a malicious endpoint. This is a blue node which in our code means it's an and. It needs all of the nodes below it to be realized for it to be realizable. So we need to be able to create the malicious HTTP route which we did in the previous step and also we need that social engineering element. Taking a break from the attack tree just for a moment to come back to the left. What we'll see is that there are different types of threats as we go on. We'll have some threats such as all of the ones in this tree which are completely generic to Gateway API. We'll also have some which are completely generic to container security because we're running on Kubernetes. There'll be some which will be Envoy Gateway implementation specific and this is where the importance of iterative threat modeling comes in because what we do is we want to have those conversations with ARCO and the maintainers. They iterate the product. So all of these demos are based on version 0.5 of Envoy Gateway so we need to update them and update the threat modeling report post 0.6 and what we'll see actually is that some Envoy Gateway specific things have actually improved in 0.6 which is massive credit to ARCO and the team. Back to the right hand side then and looking underneath the malicious HTTP route. We saw before there's two ways to do this and get this cross namespace route. The left hand side underneath the green node which is an or is the single tenant example where we have the reference grant created. We won't spend too much time on that because as you can see at the bottom you basically need because of this handshake mechanism you need a malicious developer, malicious admin or compromise credentials to do something bad here. The more interesting case is to the right where we have the shared gateway because now all we need is that the shared gateway supports a malicious namespace. Again we could have a malicious admin which isn't a very exciting attack. Another way we could do this though is instead of matching on namespaces as we had before we could match on a custom label such as M for something. And in this case any malicious actor who can successfully label namespaces could potentially add a namespace to that list of supported namespaces. So we've derived a control here and this is how threat modeling works. We do an attack tree, we derive a control and we try and cut off one of the branches. So here we just are very strict about configuring route bindings based on namespace name rather than any other custom label. The namespace name is always gonna be there and it's gonna be safe. So again this comes back to this handshake mechanism protecting us from these risks being realized. Let's look at a more nefarious attack here. So our attacker, if they really wanna compromise one of our shared tenants or the single tenant compromising an envoy proxy itself would be the end goal because we can tamper with traffic, we can sniff traffic, we can compromise the availability of traffic delivered to back ends. This is the ultimate goal. So let's just have a look, not a very exciting look, but just a very simple example of something we could do with a malicious envoy gateway, envoy proxy image deployed by gateway and we'll look at how this is done by modifying a gateway class and a custom envoy proxy resource. So first of all, let's just look at a really silly toy container. All we do is on top of envoy gateway base image we install TCP dump and a script which just listens for incoming HTTP connections. Not very exciting. So we'll build this first of all and then we will load it into our kind cluster so that we can use it in our manifests. And once we have done this, which will take a minute, we can then set up a new tenant if there weren't enough with A to D already, we're gonna install tenant E and give them their own envoy gateway controller and their own namespace. We're just doing this for example purposes to show how you can create a custom gateway class and a custom envoy proxy. Then we look at some of the infra we'll deploy for tenant E. So we will give them at the top their own gateway class and we're going to reference out to a custom envoy proxy config, which is at the bottom and this proxy config uses our custom malicious toy silly image. So let's do, let's apply this now and see what an attacker could see. We all know what the answer is gonna be. It's just all the incoming HTTP traffic, but let's show it. So we'll exec into the envoy proxy pod deployed for tenant E and run our TCP dump script. We'll see probably some random health check traffic and stuff and you've got to imagine me like furiously opening another window, trying to curl tenant E in the background. Eventually it'll come through but I was obviously quite slow at the time. And there it is, there is our echo response from tenant E's namespace. So obviously very silly example in a real life scenario, an attacker's much more likely to want to tamper with traffic or farm off traffic to an attacker control service without being noticed. But again, just a silly example to show that compromising the envoy proxy is a good end goal for an attacker. So how would we do this? It's back to the attack trees. So our second attack tree has this as our top node compromising an envoy proxy. It's a green or node because there's many ways to do this. We could maliciously craft an image. We could do this through a supply chain attack either in an envoy gateway or envoy proxy projects themselves or we could do a dependency confusion attack or typosquatting, something like that to trick someone into installing a malicious image. The risk that we, the one that we looked at before on the previous slide, this was the second node here. So gateway class references malicious image. We'll come back to that in the next slide because we're not done with this tree just yet. Remote code execution within the proxy is obviously a possibility if we have a suitable vulnerability and network access. Gaining a shell in a running proxy container is obviously possible if the attacker has Kubernetes API access and the container has a shell. And of course we're running on Kubernetes so we have generic pivoting risks as well if there's some weak Kubernetes configuration in play. We'll do the same thing again. So this one we'll leave to the next slide but we'll look at some example controls for these other ones. So supply chain attack, the envoy gateway project can help us here by doing things like aligning with salsa standards, signing images so that people can verify that they're running trusted images and production of Sbom and Vex can help with vulnerability triage processes. Vulnerability Envoy proxy images themselves. We of course don't have to use that customized Envoy proxy resource for evil. We can do it for good. So we can pin versions to minor patch releases again to help that triage process and to make sure that we're mitigating any identified vulnerabilities. Proxy containers having a shell. Obviously we can use distrilous base images worth saying. Again, the demo was based on version 0.5, 0.6 uses distrilous base images. Again, it's that iterative process of talking with developers, a change being made. Our end goal really here is to have all of the threats just either remediated by default or a very, very simple end user implementation. So the less threats that we get over time as the project evolves, the better. We're still going to capture those in the report. It's just the recommendation will change from being something like you have to do this very specific thing to, well, just run Envoy Gateway 0.6 or higher. And finally, weak Kubernetes security. We're all here for a reason. We know a bit about this. So validating admission control, app armor, set comp, limiting Linux capabilities, things like that. We can also consider maybe a hardened runtime for proxy and gateway pods. So something like GVisor, this could help with this pivoting risk. So back to our interesting example then, which is this one circled in yellow. So gateway class references malicious image. The example we saw on the previous demo wasn't very exciting because again, we just created the resources. It would be much more interesting to see what sorts of access would we have to have? What would have to be compromised in our cluster in order to tamper with those resources? So let's look at the example of a malicious gateway class being created. And for this example, we're going to imagine in our diagram from the start that tenant A has been compromised. And more specifically, their Envoy gateway controller has been compromised. What could they do? What could the attacker do if they've compromised the tenant A namespace and that gateway controller? So it's the third and last demo of the day. Let's have a look. Version 0.5 uses a cluster role for Envoy gateway. So let's see what that cluster role can do to Envoy proxies. Note we can't patch Envoy proxies. So that's going to just patching the proxy image used by the shared gateway, for example, to get a malicious gateway for the shared tenants. That's not going to work. However, let's look what we can do with gateway classes. And we will see that we can patch gateway classes. So let's see if we can create something like this. So we've got a malicious Envoy proxy. It's not a stretch to imagine. We've said tenant A has been compromised. Maybe we have a malicious custom proxy config in the tenant A namespace. Now, can we patch the gateway class to use the shared config? So what we really want to do is create this malicious gateway class from our compromised controller in tenant A, but we're actually patching the gateway class for tenant C or B or someone who uses the shared gateway. So what we'll do is we'll just apply our malicious Envoy proxy resource because we're saying tenant A has been compromised. That's in the tenant A namespace. And now we've got this little script in our repo which will perform an action, a cube control action, as the tenant A compromised Envoy gateway controller. So what we'll do is patch that gateway class resource, restart our pods in the shared namespace. And then what we'll do is just show that the pods have been restarted recently. And now let's grep out the image used by our shared Envoy proxy. And we can see that it is not our TCP dump image, so it hasn't worked. Again, this is a mitigation built into Gateway API which I'll come back to in just one second. But in 0.5, we still do have to, we've got a gateway class which that was very quick at the end. Apologies for that. Basically, you could enumerate secrets in different namespaces. So just having the gateway class which is permissive is something which was I think the right move to go to a different model in 0.6 which we'll see has been done later on. Just going back to that thing about the, why didn't that work? So we patched the gateway class. The reason this doesn't work is that in Gateway API, if you, the state of a gateway is bound to the state of the gateway class at startup time. And what this does is it just reduces blast radius. So again, we can't patch a gateway class and have it affect all of the running gateways underneath that gateway class. We would have to actually create new gateways and that would be complicated because you'd have to get trick people into using those gateways for their routes and stuff like that. So this massively reduces blast radius and is another example of Gateway API and any implementation which does it well such as Envoy Gateway helping us out here. So this helps our threat model massively. We brought up a threat here which was the cluster role and maintaining awareness of cluster roles is obviously key. Note, I realize we're almost out of time, aren't we? But we'll finish up quite quickly. Gateway classes will always be cluster scoped but like I said before, ARCA and the team really, really have been really engaging during this process and 0.6 does not require these cluster roles. So already we're getting built-in mitigations against some of our threats. Business context has to be added. Normally you would do this when you do a data dictionary. This depends on the data held by your client. It's a bit hard to do this and hence to prioritize threats in a very generic case where anyone could be using this project. So what we've done is a very generic, kind of like impact likelihood thing, just to prioritize threats. We've got some other examples that we can go through. I don't think there's much value going through specific examples right now because we'll push the report and you can read through these in detail. But things just like good PKI key management, there are really good docs on the Envoy Gateway documentation sites regarding use of cert manager and stuff like that. The threat at the bottom, misconfiguration of dynamic proxy, dynamic config, this is really what we've been talking about with those attack trees throughout the whole talk. And what we have to do is basically just adopt those RBAC models shown by Torrin at the start, either the three or four tier, whichever is more relevant to you, maybe consider a GitOps model for dynamic config and a secure baseline, instead of using like this shared model, maybe single gateway class per cluster, Envoy Gateway controller per tenant, tenants with similar data sensitivities, stuff like that, soft multi-tenancy assumptions, I think is a safe thing. And Matt is looking at his watch. So please check out Envoy Gateway 0.6, stay tuned for the threat model report which we'll push to the repo. Currently in draft, we're updating and once it gets pushed, we really, really would appreciate comments and feedback and collaboration. So thank you very much.