 Hi, everyone. I'm Sanjay Pujare, a staff software engineer in Google Cloud and part of the GRPC team. I lead the PSM security offering, a PSM being Proxella Service Mesh. In this talk, I'm going to cover a GRPC Proxella Service Mesh intro and summary. I'll also talk about why we need security in this service mesh and why is it so painful today. I'll talk about how we added security to the GRPC Proxella Service Mesh and how it works. I'll do a technical drill-down with a deployment diagram. Then I'll talk about what changes we made to the GRPC library, specifically the GRPC programming API for using this feature. Later on, I'll basically describe a PSM security deployment, what it looks like in Google Cloud as an example. Towards the end, I'll cover the future roadmap. I also have a slide that has some links to resources for PSM in case you're interested. Towards the end, we'll have a few minutes for Q&A. Now, I'll give a brief intro to Service Mesh. So, Service Mesh with Proxies was the initial model. This is how the Service Mesh started. In this, transparent Proxies in red boxes allow existing applications to be in a Service Mesh. The applications shown in blue boxes are not aware of the Service Mesh. The Proxies implement these. Is it me? Okay, cool. So, where was I? Okay, so the Service Mesh includes features like service discovery, routing, load balancing, security, and observability. Now, GRPC is a popular framework for service-to-service communication, as Randy just mentioned. And so, we were thinking, can we add Service Mesh features or awareness of the Service Mesh to GRPC? So, in this case, applications would still be somewhat unaware of the Service Mesh around them, but the Service Mesh policies would be transparently applied to the GRPC traffic by the GRPC library. Now, let me talk about XDS, because XDS is frequently mentioned in the context of Service Mesh. XDS is a protocol for control plane to talk to data plane entities. Hence, it's called a data plane API. XDS, where X stands for something, like an unknown quantity in equations. And DS is a discovery service, so like discovery of clusters, routes, listeners, et cetera. XDS was developed for Envoy, but it is pretty open and extensible for any kind of Service Mesh. So, GRPC adopted it and extended it for the Proxeler Service Mesh. So, when the pandemic was raging, we were enhancing the Proxeler Service Mesh with GRPC. Our first release last year in June had basic load balancing and service discovery. Over the past year, we added various advanced traffic management features like traffic splitting, circuit breaking, and session affinity. So, I'm kind of curious, this audience, how many of you have used a Service Mesh with a proxy? Maybe a show of hands? Okay, cool. And how many of you know about the GRPC Proxeler Service Mesh? How many of you have heard about it? Okay, not that many. Okay. Now, before I dive into security, so far I've talked about just Proxeler Service Mesh without security. So, just to recap, Proxeler Service Mesh, the current status, there was a previous KubeCon presentation by our product manager, Megan. In May earlier this year, that talks about the GRPC Proxeler Service Mesh and what we had until then. You know, basically the load balancing plus advanced traffic management features. So, I've included links to the slide set and the video recording here. At Google's Traffic Director side, there are a couple of blogs that talk about this. Now, Traffic Director is Google's implementation of the XDS control plane. The Traffic Director user guide covers Proxeler Service Mesh, and you can actually, you know, get a test account, I believe, with some free credits on Google Cloud and do a test drive of the Proxeler Service Mesh. Now, before I go into Proxeler Service Mesh security, I kind of wanted to summarize why security is so important in Service Mesh, right? Now, you can think of a Service Mesh as the equivalent to breaking up a monolithic application and what used to be in-process communication inside the monolith is now RPCs over the network. And as a result of that, these RPCs need to be secure. Now, these RPCs are routed and load balanced as part of the Service Mesh orchestration. So we need security that's well integrated with other aspects of the mesh. I know things like routing, load balancing, and service discovery. As an example, each endpoint needs to be able to validate its peer certificate and identity using the control plane supplied information. Now, in this picture, the yellow layer talks to the red layer and the red layer talks to the green layer. And these RPCs cross network and infrastructure boundaries. The cloud-like lines show the various boundaries the RPCs have to cross. And the routing across these boundaries is orchestrated by the control plane. So we need a compatible security configuration from the control plane for the endpoints to be able to authenticate and authorize their peers. Once that is done, traffic is secure. Service identities are the foundation of the Service Mesh and these are verified through MTLS in a Service Mesh. Server authorization and client authorization, also known as RBAC, at least in our GRPC ecosystem, depend on service identities. Now, all of these things ultimately enable a secure Service Mesh. There is another point I kind of want to make in passing. For real effective security, we also need certificates and keys to be rotated. Do we want to burden the application developers with the overhead of certificate and key management for doing that? No, so ideally we would like the infrastructure and the framework together to do these things for us. And that's where PSM security comes in. One more thing I kind of wanted to mention is if you really want to do MTLS yourself in your application, it's a huge pain today because you have to do the certificate management yourself. First of all, you have to modify the code to load certificates and use them to create a TLS or MTLS connection. Then you also have to add code to perform additional verification as per your semantics. For example, only certificates with certain identities are allowed for certain services. And also it's a headache for folks deploying and configuring your software. They need to manage things like putting certificates in respect to directories, configuration files that are needed to kind of link to those directories, and of course managing the code or the application that is going to make use of those certificates in the directories. So another pain point is what happens when the certificates expire and they need to be replaced? In many cases, you actually have to restart your application. So when it comes to Proxilis ServiceMesh with GRPC, we have kind of seen the advanced traffic management features I talked about. But what about securing this traffic automatically? And that is where PSM security comes in. It gives you service to service security. Now, this is transport security, which is MTLS for XDS managed GRPC connections. Adjust a refresher on MTLS. It gives you authentication, encryption, and server authorization. I'll talk about server authorization in a bit. Now, how does it all work? All right. Note that GRPC is an RPC framework in a library. It cannot do everything. We need infrastructure to provide us with certificates and keys and have a way for GRPC to consume them. The XDS control plane is needed to orchestrate and secure connections between the workloads by providing the security configuration. GRPC is the glue that combines infrastructure-provided certificates and keys, XDS control plane-provided security configuration, and it kind of putting it all together, it makes the traffic secure. So I mean, another question to the audience. How many of you kind of were missing security in the ServiceMesh and think that this would be like a cool feature, something that you thought might be interesting? So, okay. Now, this is a pictorial representation of how it all works. I have the client in the yellow box on the left sending RPCs to the server in the red box over a secure channel. The blue lock on that line indicates a secure channel. Both the client and server are XDS-enabled and get their security configuration from the XDS control plane shown in blue at the top. The client and server need certificates and keys that are provided by the underlying infrastructure to make it all happen. The green box represents the infrastructure components which include the certification authorities or CAS to mint the certificates and a process to continuously generate CSRs and to use them to request certificates and a mechanism to make these certificates and keys available to the GRPC workload using GRPC's plug-in feature. So when all these things are put together, the client and server are able to secure their GRPC traffic. In a bit, I'll actually... So I'll show an actual implementation of this in Google Cloud. Some more technical details for those of you who are kind of familiar with XDS and specifically how XDS works in GRPC. I'm assuming some familiarity with the existing XDS flow in proxy with GRPC. I have three columns here. One for the client on the left, one for the control plane in the middle, and one for the server on the right. This shows the configuration steps from the control plane. Here, the client is trying to reach something like a payment.service and uses XDS colon slash slash slash payment.service as the target string. So XDS resolves it. What I mean by XDS is the XDS code in GRPC plus the XDS control plane. Collectively, they resolve it and the client receives CDS and EDS. These are the XDS messages to help the client set up connections to the back end instances of the service. One instance here is here with the IP address 10.3.19.7 and a certificate ID of mesh1 slash cluster2 slash SRV3. I'll talk about the SPIFE colon prefix later. But over here on the right, the server in the red box receives LDS, which is a listener discovery service message, to help it set up the listener socket for the service. In this instance, the back end with IP 10.3.19.7 receives the config to set up a TLS listener and to bind it to the port 8,000. When the server receives a client connection, the server applies what is called a filter chain from XDS and that filter chain contains the required TLS configuration for the transport socket. The idea of filter chain is you can actually have different security configuration for different clients. So the filter chain has a filter chain matcher and based on the matching that is happening, the server is able to actually use different security configuration for, let's say, different IP addresses, different networks and all that stuff. So the filter chain has config to use the certificate with a cert ID of mesh1 slash cluster2 slash SRV3. This is just a made-up ID for the current client connection. So as you see, both the client and the server receive the required configuration to set up MTRS connections between them. So this configuration includes things like the certificate and keys, whether it should be a TLS mode or an MTLS mode connection and server authorization information. Again, like I said earlier, I'll talk about server authorization in a bit. So the design and implementation details of the work we have done is as follows. We created a GRFC. A GRFC is an RFC or a spec in the GRPC ecosystem to describe the design, and I have provided the link here. The GRFC covers the programming API, which I'll cover in a bit, the implementation of the security flow, and something new. What is that something new? That something new is the certificate provider plug-in framework, which is used by GRPC to get the required certificates and keys. The stuff described in the GRFC is implemented in GRPC Go, C++, Java, and Python. So you can try this to use PSM security because we had a public preview back in May for all four languages. And I'll talk about that later. Now some more details about the certificate provider plug-in. So GRPC was the first one to propose the certificate provider plug-in framework in the XDS ecosystem. This is an extensible framework that enables various, including custom mechanisms, to get certificates. These extensions or plugins are loaded and configured locally using a bootstrap file. The XDS control plane just references an instance of a configured plug-in. So as an example, the XDS control plane might reference an instance called XYZ, which is looked up in the bootstrap file. GRPC expands XYZ to a plug-in name and configuration. And if necessary, it will load the plug-in. It will use this instantiated plug-in to get certificates and keys for a channel or a server. And GRPC also makes sure that it has set up a pipeline to properly, you know, a pipeline is properly set up between the plug-in and the consuming channel or server. The pipeline is dynamic, which means when certificates and keys are updated, the updates are immediately reflected in the channel or the server. Now, this simplified interface allows both Envoy and GRPC to use this interface. This is made possible because of the interaction provided between the implementation and the XDS protocol. It abstracts out the certificate provider implementation in GRPC. We currently have a plug-in called FileWatcher that watches certificates and keys in the file system. Some more stuff. This is a pictorial representation of the certificate provider plug-in framework and how it works. A channel in the blue box on the left is configured by XDS credential. The same thing on the server side. Server connection, which is shown in the yellow box, is configured using an XDS credential. The XDS credential sets up a certificate provider and an advanced TLS handshaker. It sets up a pipeline for the certificate provider to provide dynamic certificate updates to the TLS handshaker. So the certificate provider continuously gets periodic updates, certificate updates, and gives it to the handshaker. The advanced TLS handshaker is responsible for channel or the server connections TLS handshake based on the certificates provided by the certificate provider. Now, as you have seen, we kind of depended on external components to make this whole thing work. So what exactly is the value add of the GRPC library? What's the functionality added by GRPC? So within GRPC library, we have a new programming API. We have a new XDS implementation of what is called a transport socket config. In GRPC, we also implemented the certificate provider plugin framework and how it is indirectly referenced via XDS. In GRPC, we also extended the bootstrap file schema to add fields required by the certificate provider plugin framework. And, of course, as part of that, we also implemented the file watcher certificate provider plugin, which supports dynamic certificate and key updates of file-based certificates and keys. Now, let me talk briefly about Spiffy. I had mentioned Spiffy earlier for service identities. Now, this is not really a GRPC thing, but this is a new spec or standard for identities in a service mesh. So a Spiffy service identity is assigned to a microservice and encoded in the services certificate. The service uses the certificate both on its outbound, which is on the client side, and the inbound or the server side. A sample Spiffy implementation in Google Cloud encodes a trust domain and identity within the trust domain. The trust domain is an identifier for the trust infrastructure, and in Google Cloud, it is typically the Google Cloud project ID, because the trust infrastructure is associated with the project, the Google Cloud project. And the rest, the remaining part of this Spiffy identity is a unique identity for the service, and in our implementation with GKE, it typically is the Kubernetes namespace name and the service account, the Kubernetes service account for that particular part. So with Spiffy identities, it enables server authorization and server authorization replaces the traditional host name check that happens in HTTPS. Spiffy identities also enable client authorization or RBAC using these Spiffy identities. The client authorization is not really part of PSM security. It's another feature that is coming soon. So how does one use this stuff, for example, in Java? The example in this slide is from the GRFC. You can look it up and see this example there. The usage in C++, Python and Go is similar. There is an XDS channel credential that you supplied to your channel builder, and this credential tells GRPC to use XDS supplied security configuration. There is a similar server credential on the server side that instructs GRPC to use XDS supplied security configuration for the server. Now, I wonder what the insecure channel credential are used inside the create method here. This is a fallback credential. I'll talk about that in a bit. Now, XDS channel credential is a way for a caller to obtain the use of XDS security configuration. Note that a caller can use a different credential. For example, TLS credential with a channel, in which case the XDS supplied security configuration is ignored, even if for routing, load balancing, et cetera, the XDS configuration is used. Now, in this example, because of the XDS colon scheme used in the target string, XDS colon slash payment dot service, the XDS name resolver and eventually the XDS load balancing routing config is used, but the security configuration from XDS is not used. Something about the notion of fallback credential. So an XDS credential also takes something called a fallback credential, which kicks in if XDS doesn't supply a security configuration. So instead of choosing to treat this as a plain text or insecure communication, a caller can tell GRPC to use the fallback TLS credential, as I have shown here. So where can you deploy and test this stuff? Like I mentioned earlier, you can get a test account on Google Cloud and use the traffic director user guide with the section that talks about Proxseller service mesh security and try this out. The main elements of this offering is traffic director, the XDS control plane, something called Google CAS, which is certification authority service. Then, of course, GKE, which is the compute engine, or the compute infrastructure you have to use. The user guide will help you understand the flow and there are setup instructions you can follow. So let's look at the previous diagram, which I have modified to show the Google Cloud implementation using traffic director. We have the GRPC client and the server. Traffic director is the XDS control plane supplying XDS configurations to the client and the server. I have the file watcher plugin inside the GRPC certificate provider framework that makes use of the something called GKE mesh certificates. GKE mesh certificates use GKE workload identities and use specific encoding of those identities. And GKE integrates with Google CAS to main certificates and to make them available to the GRPC workloads. Now, the four bullet points at the bottom right describe the layers bottom up. So at the bottom, I have the CAS certificate infrastructure to main certificates. Then I have the layer above that, which is the GKE to CAS integration. Then above that, I have GKE supplying the certificates to the parts where the GRPC workloads are running. And at the top, I have the traffic director as the service mesh control plane, which is orchestrating the whole thing. So that was about PSM security. So I'll talk about some roadmap items in this area. I did talk about something called XDS authorization, or RBAC, that is described in a GRFC A41. I have provided a link here. And there is something new called SPIFI federated trust bundle for proper federation as per the SPIFI spec. In this, what happens is there is a map of a trust domain to root certificate so that you don't use the same root certificate to validate all the trust domains. Most probably, we'll use something called a configurable certificate validator, which is an XDS extension point to support this federation of trust domains. One more thing that could be coming is more certificate provider plugins based on user demand. There's one more interesting thing, which is something called a handshaker service or secure service agent, where the TLS handshake is offloaded to the agent. And now the advantage of that is the certificates and the private key are never shared with the user space application. So this kind of prevents exfiltration of certificates and keys because these are minted by our certificate, you know, the certification authorities. Another potential development is Envoy, adopting the certificate provider plugin architecture. This will facilitate deployment of interoperable Envoy and GRPC workloads in the same service mesh. So this is my last slide. Basically, it has links to various resources to get more information. The five GRFCs, PSM security, Java, channel, server credentials, and, you know, stuff like Odzi that I talked about. Then a couple of links to Google Cloud blog for an intro to this feature in the context of Google Cloud. And of course, there is the traffic director user guide for security with Proxeler service mesh using GRPC. So that was it. Questions? So we've got just a couple of minutes for questions here. Maybe I'll throw a couple from online. Okay, sure. There's, I think maybe for some of the folks that are newer to GRPC, some questions about how does the effort here, is it something that can be abstracted to other environments or is it just specifically for GRPC, any support for HTTP2, things like that in the mesh side of things? I mean, as far as I can see, it is specifically for GRPC in terms of app starting it out or kind of reusing it for other frameworks. You could probably use ideas, like using XDS to orchestrate stuff like that, but the actual design and the implementation we have done is kind of very GRPC specific. Yeah, it's really about how GRPC is driving XDS and those components. Right, right, and it's specifically about, you know, if you look at the GRPC channel and server architecture, it's kind of very tightly integrated with that. Great, thank you. Anyone in the room with a question? Yeah, okay. Thanks, is this compatible with, like if you build an application with GRPC and it's not part of a, it's not deployed in a mesh, can this connect to the mesh without having to deploy proxy then? So when you say it's not deployed in a mesh, you are saying there is no control plane? Yeah. Yeah, that, I can't think of a way to make it work because GRPC needs the control plane to orchestrate many of these things. Yes, yeah. Great, so maybe one more question? This is kind of similar, but will GRPC proxy-less work on a mesh that's not Google traffic director say if I'm on Google traffic route mesh? Yeah, yeah, it does support, yes. Yeah, we kind of, so, I mean, I talked about Google Cloud implementation because that is something that has been tested and works and all. Otherwise, everything we have done is conforming to the XDS protocol without any Google-specific extensions or Google-specific implementation. So, yeah. You can take a control plane, I will like Istio and all, and I can try this out. Fantastic. Sandeep, thank you very much. Okay, thank you.