 Hi, my name is Amin Kinaben. Today I'm gonna speak about the security evolution of GRPC inside the mesh, particularly inside Istio service mesh. I am a software engineer with focus on cloud native and free and open source software. Today I work at VMware in the standalone tons of Kubernetes grid acting as a security and Windows tech lead. I have been contributing to the Kubernetes project since 2020, inside SIG network and SIG Windows subprojects, mainly. My main interests are operating systems, computers, networks and distributed systems. For the agenda, we start taking a look in the mesh and the zero trust security, how they fit together and what's the relationship between them. Next, we pass through the Spiffian service identity problem inside distributed workloads. We look at the Spiffian specification details and the sparring implementation. Inside the Istio side and service mesh, we take a look at the ECOS traffic specifically and focus on GRPC, taking a look in the evolution of the protocol inside the project. Starting with the GRPC proxy list, we are going to deep dive and take a look in how it works. And next, we are going to deep dive in the ambient mesh, Zetanel, Waypoint and all the foreign layer capabilities related to this. In the end, we finish with the conclusion and the wrap up for the end of this presentation. Starting with the introduction, what is the mesh and how it relates with zero trust security? Okay, if you can provide security features that use well, battle, death, and technology stacks, it can mitigate internal and external threats against our data, endpoints, communications and the platform. The service mesh can give you ownership of authentication and authorization layer, decoupling everything from the development side and giving you more flexibility for managing the entire ecosystem. The Istio provides, of it, protecting against non-repudiation attacks as well. Besides that, it has some strong GRPC capabilities, compliancy, compliant with zero trust security. But what is zero trust security nowadays? So, this has this specification called SP800207, where it says the zero trust model assumes that an attacker is presenting an enterprise on an environment. So, any workload, independent of the environment, is not trustable, unless proven. So, we enforce the principle of least privilege for network and applications. The zero trust security is not a single architecture, but a set of guiding principles for workflow, system design and operations that can be used to improve the security posture of end classification of sensitive level. Neither zero trust and the service mesh are silver bullets. Let's be very clear here. So, when you talk about the zero trust security, you are not talking about a final architect or a final hard design for system, but besides that, a few principles that combine it can bring you a more reliable architecture for systems. So, from these principles that are detailed in this document, we have the fourth principle that's very interesting. It says, access to resources is determined by dynamic policies and they include other behavioral and environment attributes. Meaning, we can define dynamic policies for your cluster or your system that includes behavior and attributes for the environment. How you define these attributes, they are defined by policies and the policy have four traits. So, subject is the first trait and it indicates the entity performing the action. Second trait is the action. So, the action being performed by the subject. Third is the target. The object action is being performed on the table and the force is the condition. So, for the policy to be applied, there is a need for a condition to happen. Another interesting principle for zero trust is the enterprise monitors and measure the integrity and security posture of all wanted and associated assets. So, we use you can provide telemetry and logging and now did by default for all the services inside the mesh. And that's how we are going to see more details but that's the beginning of how issue and zero trust starts to make sense together and how the implementation of zero trust principles are the core build of issue. It's important to note that the zero trust principles exist as an auxiliary components as well. So, you have other components that completes your entire architecture. They can be enterprise, public, key infrastructures, CM systems, threat intelligence feeds, network and system active logs, et cetera. Zero trust 101. So, we understand a few of the principles that combined creates this design and architecture. Now, we are going to see the components that make up the concept of zero trust. First one is called a policy decision point or PDP. This is like the core or where all the policy determinations are installed. The second one is the policy enforcement point or PEP. This is the edge or data plan as we call inside the issue. The PEP is responsible for enabling monitoring and management at the connection and communication between the enterprise resource and the subject that is accessing this data plan or the services. Following this diagram, we can see clearly the PDP or the policy decision point is defined by the control plan. So, WIS2D works as the PDP in this model where you have the policy in giant and you have the policy administrations, the administration that allows you to create policies dynamically. At the same time, if you see the data plan or the proxy, the invoice sidecar in the traditional model, works as the PEP or the enforcement point where through XDS, enable us to apply the policies from the PDP side. So now we need a way to identify our workloads inside this environment. The workload needs a specific identifier for us to trust and create our policies up and on. And this PEP protocol implementation was created exactly to provide us a common way to do that. Despite it, as we are going to see is the implementation of the SPIFI protocol. So SPIFI stands for Secure Production Identity Framework for Everyone. It is a set of open source standards for securely defined software systems in a dynamic and heterogeneous environment. The SPIFI ID is the way you identify your workloads. It starts with a trust domain and has a workload identifier inside of it. The workload identifier on Istio and Kubernetes can be the service account of your workload. The other important trait here is the SV, or the SPIFI verifiable identity document. It supports two methods of communications. The first one is X509SV, and it uses certificates for east-west traffic. The second one is the JOTSV, and it uses it for north-south traffic communication. The good news is that Envoy and Istio have an implementation of the SPIFI-inspired side of themselves, and you can use them transparently for MTLS communication across your workloads. So talking about the aspiring implementation. So the aspiring implementation implements the SPIFI API and performs both node and workloads at the station, and this is a critical part of the entire protocol. It issues SV to the workloads after they are tested inside your aspiring program. The good part is that the server and the agent are very dynamic, and they have plugins where you can extend and add your own logic to validate these kinds of attestations. The way it works is that first, the node starts attestating itself, and it requires that the agent authenticate and verifies itself when connecting to the server. In our case, the inspired server API. So your node will have workloads inside of it, and as the second phase, the workloads need to attest themselves as well. So you can use a few methods of the attestation for workloads, but basically it answers who is the process that's being authenticated and identifying itself inside your zero trust domain. In this example here, we have an Envoy, an OBAP policy user. This is like a practical example inside the SPIFI website. You can find all the goals to run them. Here we are going to analyze the diagram and understand better how this is implemented. So in the example here, we have two front-ends, one on each side, and in the end you have another server called back-end. If you see the communication between the front-end and the back-end uses MTLS, meaning mutual TLS authentication between both sides when this connection happened. So as you can see, the Envoy proxy has a big part of this, and here we can have a hint on how it started to create this MTLS authentication inside the service mesh. So Envoy allows you to connect to another Envoy and have this communication through MTLS. Still, you need your Spider-Asian running on each one of the nodes, and this will use SDS to extract the Envoy, to extract the certificate and the CA certificates for the Envoy. So the agent has this proxy authenticated that communicates and brings the certificates to Envoy and transparently can communicate and authenticate across your double-gold mesh environment. In the end, we can see this example here of unauthorization. So after you have this authentication through MTLS, you can authorize your workload to access or not. So basically, in this example, you can use Open and you can use a description language to validate this disinformation. So the default will be everything is false in the allow path, and you can see everything that comes as a method post, and it's coming from this PFID in this case, Closer Local as addressed to May, and the namespace default with the default service account gRPC-A is allowed to authenticate. So let's talk about the issue-serve-smash and the gRPC connection in the east-west overview and how this can fit the zero-trust secret model as we saw in the beginning, and let's take a look in the history of gRPC treated inside the project. So gRPC proxy was the first idea of removing this envoy as the sidecar in your application. So gRPC library can connect directly to East UD via XDS library. So your application has the capability to connect without any interface to East UD directly. It still needs a bootstrap file with some settings. So to inform how your application will connect to East UD and you still have this sidecar of East UD agent, but you don't have any proxy related to that. So as you can see in the picture, the East UD agent provides you the certificate and the gRPC-XDS bootstrap for control plane connection. The cool part of this is the footprint of resources used is super low and all the logic resides on your application and you don't have any external path to get all this information that you need to bootstrap your application. So the second way and we are going to see a demo as follow is the East UD Mesh. So East UD Mesh was another idea from the East UD team to remove the sidecar totally. So the ambient Mesh is split into two components, basically. First is the layer 4 component and the second is the layer 7 component. So the layer 7 is still in for a proxy, what they call the Waypoint proxy, but it's not required in terms of zero trust and MTLS connection. Other connections happen at the Zetunnel level. This is a Rust project that was created exactly to provide only this capability of overlay and encryption between the nodes and workloads. So as you can see the East UD still works as a BDP and in this model the Zetunnel or the Waypoint are your PAP. The Zetunnel is a demo set and the Waypoint is a deployment and don't necessarily run inside of each one of the nodes of our cluster. In the example here of the diagram just to illustrate we have another demo set called East UCNI that setups the IP tables roles and enforce the workloads traffic through Zetunnel. So Zetunnel detects if the other workload that is inside the mesh has or not the Waypoint and forward the traffic to Waypoint if it exists and the Waypoint can be created by service account or for your entire name space. Decreasing the amount of overhead that the sidecar had in the traditional model. So after the Waypoint receives the traffic from JRPCA it forwards the traffic to the Zetunnel on the node workload and Zetunnel through the IP table rules can find the correct workload routing. To illustrate this authentication and authorization on issue and how this dynamic policies work in the left side we have the authentication. So the authentication is pretty simple it means you have or not your MTLS enabled for all the workloads. So if you have this the spare implementation of East U and Zetunnel will take care of the authentication and certificate generations for your workload communications. The more interesting part here as we are going to see in the demo is the authorization policy that can be created as a CRD in your issue service mesh. So the first thing we can see here is the selector for this authorization policy or the target. So you can set this to be applied on a specific workload in our case we are saying ok install inside the Waypoint inside your cluster next you have the action it can be deny or allow in our case it is allowing this specific authorization policy or the conditions if the conditions match it will allow you to execute that. Second you have the subject and you have your from field saying ok this is the subject that created this policy is applying the policy right so it will filter for this particular JRPCA saying all the workload all the traffic that is coming from this SPIFI ID in particular can be accepted and final we have the condition. So for this condition to happen we have the operation and we have paths. The operation on JRPCA is only posts so JRPCA has ATTP tool and it uses posts as a default method of operation and we can filter by paths both by servicing and methods inside your JRPCA as we saw in the example the DOM server stream is the method we are filtering out. Let's watch the demo locally and we are going to install Istio server smash Istio and then the first thing we are going to do is run Istio CTO install with the ambient variables in the settings now the customize is being used to install the gateway the ICRD that is required for running Istio ambient and the waypoint component besides that is being installed a few other components like Geali and Grafana and the full namespace we have the Istio data plane mode equals ambient meaning all the workload that is running in this namespace are going to be in the mesh so now we can use KE9S to take a look at the workload that is running and at the same time we are going to deploy our application behind the scenes the spec we are like to gRPC workloads a client and a server as we are going to see in the following sections if we take a look in this spec here we have the spec apps and the deploy we have a service account a service of 9000 and a deployment object with a specific point to the latest in the image here going to one of them showing in one of them and the gRPC curl for the gRPC B service so we join the gRPC service and curl gRPC B service if we get the bing gRPC curl sh we can see we are sending a payload to the service on 9000 in the service gRPC bing gRPC bing and the method is dummy server stream some of the policies that we are going to create to create some authorizations in the server the first authorization policy that we have we will apply to the Istio waypoint with this label here and we are seeing a law everything that comes from the gRPC a spec ID pointing to the service account in the full name space with this path we have the path of the dummy server stream method and we have some server reflection that is being used by the waypoint as well and this we will apply the policy so we can go into we can go into the gRPC B and try to gRPC curl the gRPC A in the opposite direction and this will tell us that this is a access denied are back invalid because we have only an authorization policy going to the gRPC A to the gRPC B and not the opposite so if you want to do this we need to create another authorization policy gRPC A and try to do the request to the gRPC B this will work as expected right this authorization this authorization policy here and say ok now my method is not dummy stream the server stream anymore is like the real server stream and this dummy server stream does not allow it anymore and if I block this let's say this is a production server or method that was created the one that is running is not blocking it automatically but when I try to run again it says ok access denied you cannot access this anymore because this method does not exist so you can come back here and say I can roll back my authorization policy to allow this request and say ok dummy server stream I changed this authorization policy and I try to curl this again and everything is working back so let's wrap up and have a conclusion of this presentation a few interesting tests that I have been conducting on a local cluster so don't take it seriously it's only to measure a few informations that can be compared so the first one is a traditional sidecar as you can see we have 5000 connections and we have a thousand connections per second for all the tests and it took like almost 1 millisecond or 2 milliseconds and the average of 0.88 milliseconds what is not bad the second example is using L4 Zetano so the L4 Zetano has a better outcome it has 300 microseconds for most of the connections and the average of 180 microseconds of response and the winner is the gRPC Proxless with the average of 108 microseconds as well but most of the requests came to 300 microseconds so the example of the project was the issue testing app and the load testing project is the G8Z you can replicate this kind of tests it's pretty simple and it's invited for you to do this in your real workload and real closer before running these things in production alright so what's the right choice for me or how I can pick one of them the Proxless gRPC with issue is still reduced in functionalities it doesn't resist consumption as we saw in the examples but it still needs more integration implementation in their roadmap but it's the couple from the control plan and couple for a cold so if you don't plan to run your gRPC services outside inside the mesh full inside the mesh it has a few things you can implement everything inside your cold for the ambient mesh it provides a solution without the sidecar it has authorization policies it has more complex features if you use the waypoint it adds a few of the latents when you add the invoice but still gives you the entire service mesh it can integrate with the sidecar service mesh and this is a fully development and going to data in the roadmap soon it is the couple from the cold and has the same principles of issue in the idea but it's still coupled with the data plan alright so should go to flat that help creating the content and thanks if you have any questions feel free to reach me out thanks and see you next