 Hello, I'm Mark Roth, and I'm here to talk about XDS support in gRPC. Here's what we're going to cover today. We'll talk about why we're doing this, what changes we needed in the XDS APIs to support non-proxy clients like gRPC. We'll talk about some gotchas we discovered in the XDS transport protocol, and we'll talk about what management server maintainers need to do in order to support gRPC alongside Envoy, and then we'll cover our current status and roadmap. Let's dive in and talk about why. As you know, Envoy is commonly used as a sidecar proxy in service mesh deployments. There are two main advantages that Envoy brings here. It provides a whole bunch of really nice traffic management features, and it provides centralized management via XDS. Here's a diagram of a typical service mesh topology. Every node in the mesh has an Envoy instance running as a sidecar right next to the application. The red lines here are data plan communication. You can see that Envoy intercepts all traffic in and out of the application and manages communication between the nodes. The blue lines are control plan communication. The Envoy instances get their configs from the XDS server. So what's the problem with this approach? Why not support XDS in gRPC? Well, the proxies add overhead. My team did a gRPC benchmark comparing the proxy and proxy list approach, which showed about a three times improvement in queries per CPU second. So there's a significant CPU overhead to running the proxies. There's also a significant latency improvement. I don't have specific data on that. Istio does have some data that shows that request latency is significantly higher with the proxies, although their benchmark is not exactly the same as the one that we did in gRPC. It's not exactly the same scenario. So it's a little bit of an apples for oranges comparison, but that's the best data that I have to share. But certainly, I don't think anybody would doubt that there is additional latency by adding those two hops for all the communication. Now, for a lot of applications, it makes complete sense to pay the cost of the overhead of the proxies, right? It avoids having to implement the traffic management features and control plan integration in the application. And this is especially true if there's a large number of applications that are all going through Envoy, right? You'd have to pay that cost multiple times of doing that implementation for each of those applications. So it does make a lot of sense to do that for those applications that you don't want to have to add these features separately. However, by supporting these features in gRPC, applications that use gRPC can get these features for free without the cost of the proxies. And so that's really the motivation here. Now, I should note that we are, you know, the motivation here is not to try to put Envoy out of business, right? We're not trying to replace Envoy. There's still going to be a whole bunch of non-gRPC applications that will need Envoy. And frankly, there are still going to be, you know, a number of use cases where it'll still make sense even for gRPC traffic to continue to go through Envoy. So the goal here is really for Envoy and gRPC to be able to coexist in the same service mesh. So here's that same service mesh topology diagram as before, but this time using proxyless gRPC. Instead of having a sidecar proxy on each node, the application just links against gRPC, and gRPC directly provides the traffic management features and the control plan integration by taking in the configuration from XTS. So there's no need for the sidecar proxy. All right, let's talk about what changes we needed to make in the XTS APIs to make this work. Now, the XTS APIs were developed specifically for Envoy, so they make the obvious assumption that the client is Envoy. And this caused some challenges for us because not only is gRPC not Envoy, but it's not even a proxy, right? It's a client and a server. Now, we've generally tried to minimize the places where we needed to make changes in the API to support gRPC for two reasons. One, we want to make it really easy for existing Envoy service meshes to make use of gRPC without having to duplicate all of their XTS configuration resources. And also, because we want to allow gRPC and Envoy to easily coexist in the same service mesh, as I mentioned earlier. So the more that we can use the existing infrastructure and leverage that rather than having to create new stuff, the better. So we really want to minimize these changes. But there are two places where Envoy and gRPC are sufficiently different that we did have to make some changes. And we'll talk about each of those here. The first thing we changed was the way that the gRPC client handles listener resources. Now, as you probably know, the main XTS APIs are LDS, RDS, CDS, and EDS. They're generally chained together in that order, describing the path that a request takes as it moves through Envoy. And it all starts with a listener resource from LDS, which tells Envoy what port to listen on. And that concept simply doesn't apply to the gRPC client, right? The gRPC client doesn't listen on ports. It was told by the application to send requests to particular virtual hosts. Now, when we were first looking at the XTS API, trying to understand it and trying to figure out how to make it work for gRPC, our first attempt to build a mental model was sort of, okay, let's figure out which parts of the API apply out of the server and which parts apply to the client, and we'll use those in the appropriate places. So we thought, well, okay, maybe LDS and RDS are really about configuring incoming requests. So we'd use those on the gRPC server. And CDS and EDS are configuring outgoing requests, so we'd use those on the gRPC client. So it turns out this doesn't actually work. Most of the configuration that we actually need that's really interesting to configure things on the gRPC client is actually in LDS and RDS. This is things like traffic splitting, retries, timeouts, that sort of thing, things that we really do want to set on the client, but they're in LDS and RDS. So we were talking with the Envoy folks and trying to figure out what the right way to model this was, and Matt Klein suggested a new approach, which is something called an API listener. So what is an API listener? The idea here is to generalize the concept of a listener such that there are different types of listeners, and each type of listener configures a different type of interface for the application. So you'd have two main classes of listeners. You'd have your existing socket listeners, like the TCP and UDP listeners, which basically configure a TCP or UDP socket for the application to use. Or you have API listeners, which will configure some sort of API for the application to use. And the one type of API listener that we've added is an HTTP API listener, which configures an API with HTTP-like semantics. This includes the gRPC API, which is basically a subset of HTTP functionality, because the gRPC protocol is actually built on HTTP2. The hBAPI listener, it should be noted, is also intended to be, it's not gRPC specific, it's intended to be used by Envoy Mobile in the future, and there could be other HTTP XDS-aware clients that might use it in the future as well. You could also have other types of API listeners in the future. For example, it's easy to imagine that there could be an SQL API listener that would configure some sort of SQL-based API for an XDS-aware SQL client. Now, the configuration of an HTTP API listener is a subset of the socket listener configuration. So a socket listener primarily contains a list of L3, L4 filters, where the HTTP connection manager filter is the last filter in the L3, L4 filter chain, and then that's where all the L7 functionality is configured. But in the case of an HTTP API listener, we don't need any of the L3 or L4 functionality, we just need the L7 stuff. So an HTTP API listener's configuration is essentially just the configuration for the HTTP connection manager filter. Now, the second change, other than introducing an API listener, the second change that we made for GRPC and XDS is related to the special way in which Envoy uses LDS and CDS. Envoy uses those two APIs differently than it uses the other XDS APIs. And the special behavior doesn't really make sense for GRPC. So the general XDS flow for Envoy is when it starts up, it immediately sends both an LDS and a CDS query in parallel. These basically act as sort of dual routes for Envoy's configuration. All of the other resources are sort of chained out from the listener and cluster resources. So for example, when Envoy receives the LDS response, that tells it what RDS resources it needs, so it goes and requests those resources by name. Similarly, when it receives the CDS response, that tells it what EDS resources it needs, and it goes and requests those by name. And all of the other ancillary XDS APIs work the same way as RDS and EDS. For example, VHDS is chained off of RDS, and again, Envoy asks for the specific resources it needs by name. So the only two APIs that are really special here are LDS and CDS. The rest of them are all quite uniform. And LDS and CDS are special in sort of two main ways. Number one, instead of asking for the specific resources that it needs by name, Envoy makes what we call a wildcard query. This is represented as a request where the resource names list is empty. In other words, the client is not telling the server, here are the specific resources I need. It's instead saying, I don't know what I need, you tell me, and the server is expected to sort of magically figure out what resources are appropriate to send to the client, typically based on the client's node identity, and then it just sends a whole bunch of resources to the client and whatever it sends is what the client will use. And the other way in which LDS and CDS are special is that Envoy automatically makes both of those calls at startup. This kind of makes sense for Envoy because as a proxy, not something that's built into the application, Envoy can't have any way of knowing a priori, what ports to listen on, or what upstream clusters it may need to talk to. And so it needs to know about everything it might ever need. So making these queries up front sort of makes sense for Envoy. But it doesn't really make sense for GRPC, right? GRPC is built into the application and when the application creates a GRPC client channel, it creates the channel to a particular virtual host. So the application knows exactly what virtual host it's going to be talking to and it knows it's not going to need anything other than that. So we don't really want to send the client info for virtual host that it's never going to use. So in GRPC, the XDS API flow is a little bit more linear. When the application creates a channel to a virtual host, in this case server.example.com, that triggers GRPC to go and make an LDS request for a resource of the same name. And this resource will be an HTTP API listener and it acts as basically a single route for GRPC's configuration and everything else is chained off of there. So the LDS resource tells us what RDS resources we need, the RDS resource tells us what CDS resources we need, the CDS resources tells us what EDS resources we need. So in effect, GRPC uses LDS and CDS exactly the same way as the other XDS APIs. It doesn't do the special stuff that Envoy does, right? We request the specific resources we want instead of making wild card requests and we don't proactively make LDS and CDS requests at startup. The LDS requests happen when the client creates a channel and the CDS requests happen when we get an RDS response. All right, so we've talked about what changes we made to the API to support non-proxy clients. Let's talk about a couple of other gotchas that we ran into related to the transport protocol. Now, very simple XDS servers. The simplest possible XDS server implementation is to basically just say, okay, when the client makes a request for a given resource type, I'm just going to send it every resource I know of that type. Whether or not it needs it, I don't care. I'm just sending it all those resources, right? And if you do that, you can actually get away with that and that should work pretty much fine with GRPC. But it turns out that for protocols other than LDS and CDS, the XDS API, the transport protocol is actually designed such that the server really only has to send the resources when they've changed. So if the client is subscribing to four different RDS resources and only one of them is changed, the server really only has to send the one that's changed. It doesn't have to resend the other three. Now, in GRPC, we try to make things easy so that we can work with these very simple XDS servers. Even though we're requesting the specific resources that we need, we're going to ignore any resource that we received that wasn't one of the ones that we asked for, right? Which is fine. That works pretty well. But there are a couple of implications about this that you need to be aware of if your management server is trying to be more intelligent and only send things that have changed, right? So number one, the client doesn't validate the resources that it's going to ignore. So even if they're invalid, GRPC is not going to enact them. And number two, GRPC is not going to cash these resources, right? Which means that if a resource is requested later, the server must resend it even if it thinks the client has already seen it, right? So to take an example here, let's say that initially the client is only requesting resource A, but the server decides to send it resources A and B. Later on, if the client requests A and B, the server needs to resend B instead of assuming the client already has it, even if B hasn't changed, right? So the server has to be aware of what the client has actually subscribed to and make sure that it's sending it what it needs when it requests it. This isn't much of an issue for LDS and CDS since they actually require that all resources are present in every response anyway. But for RDS, EDS, and all the other APIs, because the server can send only those resources that have changed, this is something that you have to watch out for. Unsubscription is also kind of interesting. When the client is no longer interested in a resource, it can unsubscribe. It does this by sending a new request with the resource name no longer present in the request. So for example, if the client previously sent a request for both A and B, and now it wants to unsubscribe to B, it'll send a new request that includes only A. Now there's an edge case here, which is sort of interesting, which is what if the client has subscribed to only one resource of a particular type and it wants to unsubscribe? Well, so then it's going to send a request where the resource names list in the request is empty, which is actually also the same way that it's exactly the same way that the protocol indicates these wildcard queries that Envoy uses for LDS and CDS. So the answer is that the way you deal with this is that a request with no resource names is interpreted as a wildcard query only on the first request for that resource type on a stream. So the decision to use wildcard query can only be made at the beginning of a stream. If a client has previously sent a request with non-empty resource names list, a subsequent request with an empty list means unsubscribe to all. It does not mean wildcard. So management servers need to be careful to get this right. There are a lot more gotchas in the transport protocol that we ran into as we were trying to understand it. There were actually quite a few edge cases and interesting things that are sort of emergent behavior of some things that have been put in the protocol. So I'm not going to cover them all here. I don't have time to get into all of them, unfortunately. But what I did do is, as we ran into these, we worked with the Envoy community to update the XDS documentation to explain how these various edge cases are supposed to work and how clients and servers are supposed to react when they encounter them. So for anybody writing your own XDS server, or it's just client, for that matter, I strongly encourage you to read the XDS spec. We've got a lot of good information in there about how to handle these various edge cases. And if anything's unclear, you can inquire about it and we can clarify the documentation even further. All right, so let's talk about what a management server needs to do to support GRPC. Imagine that you have this existing configuration for use with Envoy. We've got a TCP listener resource that's pointing to a route configuration and that route configuration defines three virtual hosts. And let's say that you've got GRPC traffic that's currently going through Envoy to those virtual hosts. And now you want to switch to ProxyList GRPC. How do you do it? So to do that, you need to add an HTTP API listener for each of the virtual hosts. Now, as mentioned earlier, the HTTP API listeners don't have L3, L4 filters in their configuration. It's just the HTTP connection manager. But the configuration in the HTTP API listeners can point to the existing RDS resource, which can be shared between GRPC and Envoy. You don't have to duplicate that in most cases. And once you have the new resources set up, all you need to do on the client is there's a bootstrap file that needs to be created that tells the client where the XDS server is. And then you need to just tell GRPC to use XDS by using an XDS colon URI when you create the channel. And then it just works. It's magic. Now, GRPC does validate and cache the entire route configuration, right? So it can't ignore virtual hosts other than the one that caused it to look at the route configuration, because the application might later create a channel to another virtual host that might refer to the same route configuration, right? And the example on the last slide, if initially the application only opens a channel for service one, it'll get the route configuration that has all three virtual hosts in it. And then later, if it opens a channel to service two, now it's got a new listener resource that's pointing to that same route configuration, which already includes the information for service two. So we have to cache this and validate it up front so that we've already got it when it actually, if it's going to be used later. Now, it does take up a fair amount of memory in GRPC to store all this information. So if you're in a situation where the route configuration includes a really large number of virtual hosts, but GRPC is only ever going to use a small subset, one way you can avoid giving GRPC all this information that it doesn't need is to just inline the relevant part of the route configuration into the HTTP API listeners. That way, each listener only grabs the information for the one virtual host that it cares about and everything just works. You might also run into a case due to a bug or a mismatch in behavior or something where GRPC might reject to config that Envoy would be okay with. So let's say that you have a route configuration that has one virtual host that is used by GRPC and another virtual host that's used only by Envoy and GRPC is, for some reason, rejecting the config because the one that's meant for Envoy, even though GRPC will never use it, is somehow getting rejected. If that happens, then moving the relevant part of the route configuration into the HTTP API listeners is also a decent workaround. We hope you shouldn't need to do this. I only know of one case today where it's a problem. I'll mention it later. It's something that we're going to be fixing. So if you run into this, this is a possible workaround, but also let us know because we want to avoid the need to actually do this. It's just something to keep in your back pocket in case you need it. And you do, of course, need to support non-wildcard LDS and CDS requests, which Envoy will never make, but GRPC does. This will mostly be a no-up for CDS because even if you're doing a pure state-of-the-world response where you always send all the resources, that should work just fine. In LDS, the one thing you need to be aware of is that you need to make sure not to send the API listener resources to Envoy when it makes its wildcard requests because it won't know what to do with them. All right. And so finally, let's talk about our current status and roadmap. For those that may not know, there are actually four variants of the XDS transport protocol. There's two different dimensions, aggregated versus non-aggregated and state-of-the-world versus incremental. So every combination of those yields four variants. GRPC currently supports only the aggregated state-of-the-world variant. We will likely support the incremental aggregated variants in the future, but we don't have any plans, at least right now, to support the non-aggregated variants. We support XDS right now in the GRPC client only, not yet in the GRPC server, although that's, as you'll see on one of the next slides, that is in the works. And you can see the list of features that we currently support here with a few noted limitations. The one I will mention, I want to call out, is in route matching the case-sensitive field. As I mentioned earlier, this is the one case where GRPC will currently reject to config if case-sensitive is set to false. We are going to be fixing that so that you will be able to use case-sensitive false if you want, and we will not reject configurations that have that. So we do plan to fix that in the near future. Here is a list of features that are on our roadmap. We've already started implementing timeout and circuit-breaking support, and a bunch of others shown here are in our pipeline. You can watch our proposal repo to get notifications when we post GRFCs, which are GRPC's design proposals. The one thing that we know is coming in the not-too-distant future that we don't yet have an answer for, but we're going to be really thinking about hard, is how do we support HTTP filters in GRPC? The reason this is challenging is that GRPC already has a sort of plug-in mechanism called interceptors, and GRPC interceptors have some architectural differences both in terms of where they are on the stack and in terms of what their capabilities are relative to Envoy's HTTP filters. So we're going to have to figure out how to sort of bridge that gap in the API here. And that's something that we know we're going to need to do both for fault injection and for observability. I would encourage you to file bugs in our repo if there are features that you need that either we haven't done yet or aren't on our roadmap. I can't promise that we will prioritize whatever you need, but it will be very useful for us to know what features users want here. That will help guide our development. And of course, if you want to contribute to the implementation, we can help guide you in doing that as well. GRPC has three main implementations. We have native implementations in both Java and Go, and there is a C core implementation which is used as the underlying implementation for all the languages that you see here. I should note that in C-sharp there are actually two different implementations. There's the one that we provide that is wrapped around C core and there is a separate native C-sharp implementation for .NET that is written by Microsoft. That implementation, as far as I know, does not yet have XDS support. GRPC also has a native node.js implementation that has just recently started adding XDS features. This is lagging a little bit behind other languages in timeline, but they are working on catching up. There are a few control planes that work with Proxilist GRPC. Google's traffic director control plane is the main one that we develop and test against. And you can look at this. The link here goes to a blog post that talks about Google's traffic director product and how it works and what it supports. Istio also has some basic experimental support for Proxilist GRPC. They don't have any documentation for it yet, but they do have a test that shows that it works. So I would encourage people to file bugs against Istio if this is something you actually want to use in production. The more bugs you file, if you file bugs on that and express your interest in it, that would help them prioritize, no doubt. And there is a bug open requesting GRPC support and go control plane as well. So I want to leave you with one final thought. Until now, there was only one XDS client, which was Envoy. Now with GRPC, there are two, but I think there's going to be more in the future. I don't think this is the last one. I've already heard of a couple of cases where people have been talking about building an XDS aware HTTP client library. And as I mentioned earlier, there will likely be cases where people will want XDS aware SQL clients. So, and I suspect that there will be even other use cases in the future that we haven't thought of yet. So, I think this is a really exciting time for the XDS ecosystem. I think it's becoming richer and more varied and I'm really excited to see where it goes. So, with that, I will take questions. Thank you.