 Hi, everyone. Thanks for joining our chat today about XDS Relay. It's a new project in the on-white system, and we are very excited about introducing it. We're hoping to tell a story about how it started, the problems it's trying to solve, and we're hoping to answer some of the open questions. I'm Jyoti, and I'm here with Jess today. We both work in the networking team as left. We are the maintainers of the XDS Relay project, as well as the new maintainers on the Go control plane. Our contact information is mentioned on the slide, and we'd be very happy to connect and engage further after the talk. Everything has a history, and we are no different. The networking team at left has been maintaining the lift edge and service mesh over numerous years. The lift compute architecture has evolved from VM-based ASGs to Kubernetes. Initial adapters of Kubernetes slowly triple into the match as the platform became more robust, that option became more widespread. In order to maintain one way of doing things, the leadership also made a mandate to move all services to Kubernetes by the end of 2020. Live service discovery based on the open source, this service was accompanied by a Kubernetes-spot-informable model. The on-white control planes slowly evolved to accommodate both mechanisms. There were talks at QuickCon San Diego 2019, and Amsterdam 2020, which described more about the service discovery architecture. So much so that by the end of 2019, 25% of the services were migrated to Kubernetes and are running production workloads. This was enough scale for ugly incidents to occur, and at the center of it all was lift homegrown control plane. By 2019 December, it was clear that the current architecture won't scale if more services kept moving to Kubernetes. We made point in time fixes to keep the system running, and at the same time came up with a novel approach of managing service discovery. It's called X-Test Remain. We briefly described a bit of the before and after model of how service discovery works are clear. We have a homegrown control plane based on the control plane library. It's subscribed to endpoint updates from VM-based discovery service and Kubernetes API server. It also subscribes to cluster updates from a relatively low flux S3 files. The control plane is connected to the Envoy sidecars via GRPC. These sidecars could be on legacy VMs or containers on the box. So far so good. One striking difference between the legacy and the Kubernetes stack was that VMs take minutes to spin up while the pods take a few seconds to be ready. This gives the service as an advantage to keep the instance count low and aggressively scale up when traffic spikes. This pattern causes most services to scale up at peak morning and evening commute hours and then quickly scale down to. We, like most control planes, use Envoy scared of the world XPS. This means if there's a service A having endpoints and a dependent service B with M endpoints, a new pod coming up in service B will cause at least M membership updates on service A. Based on how much each service scale, this quickly becomes an M-processing problem and causes too many updates across the hundreds of services in the mesh. These operations come at a cost of elevated CPU. They frequently counter long periods of elevated CPU on the control plane instances and this caused some bad outfits out there. Next, such a huge influx of new pods causes all pods to create new connections to the control plane. At scale, connection management becomes a bottleneck. It has a cost, which adds up quickly and causes CPU pressure. Since the control plane was VM based, it could not scale professionally. The interesting part starts next. It is the control plane's responsibility to understand the update, pack the information in the GRPC object and send that across the different side costs. This process needs the GRPC payload to be serialized into network bytes and send. Serialization is a CPU intensive operation and we notice that as more and more services are adopted when it is, the control plane's CPU was to the roof when everything scaled up and down. The scale down characteristics is very similar to scale up and suffers from the same symptoms. The next question is why is high CPU bad? High CPU usage causes throttling and slows down the system. The payload cannot be serialized and sent fast enough to the side costs. This happens while there are more payloads still getting queued. The endpoint discovery at left does not have a durable storage and any missed updates would mean membership information to type words and become stale too. A stale membership can cause incorrect routing and panic routing in a void. While all these problems were happening, we spun into action. One of the first approach was to move the control plane infrastructure to communities so that it could scale quickly and professionally with other services. We could also scale up or pre-scale the existing literacy control plane before known events. Control plane parts were deployed in the Kubernetes clusters so they could scale professionally. We changed the broken channels in the broken control plane from unbuffered to buffer and buffer channels fell up length one. For the state of the world EDS, the last update wins. If one of the side costs network was getting low control and the control plane's channel was blocked from making progress, we could keep overriding the latest update in the buffer channel to maintain correctness. We performed flame graph analysis and fixed a few wasteful relation loops both in control plane library and in our own private control plane. In stand-in point update was cost-triveted. So we made sure that we did limit and batch Kubernetes endpoint updates to a considerable interval, say 10s of seconds. Envoy is even very consistent. So this worked out fine for now. Although we slowed down the membership convergence and made the control plane response time slow. All of these has led us to think about a different approach of service discovery. And we are calling it the XDS Relate. I'll take, I'll let you just talk about it. Thanks, Shodi. So XDS Relate is a project that we started early in the year to address some of the stock gaps that we mentioned. And from its conception, we had a few goals in mind. First, we wanted it to be built entirely in the open. So we spoke with a few companies in the past months operating at a similar or larger scale than Lyft. And there's always this reoccurring theme and questions of what is the standard for control plane implementation? Will Lyft be open sourcing our control plane? And control plane development is difficult and there's a lot of intricacies to get it operating at a level that behaves correctly for a particular company's infrastructure. One of the goals with XDS Relate is to be able to abstract away the layers of Lyft's control plane that we do believe is shareable. And secondly, we want this to be an out-of-the-box solution that users can run and operate with minimal knobs. XDS Relate also will implement a popular and well-supported open source Go control plane library. So we see XDS Relate as a CDN for XDS. Initially it's an aggregation and caching layer that's meant to reside physically close to XDS clients. So those running on the same region, data center, et cetera. Similar to general benefits of using a CDN, users of XDS Relate can benefit from faster delivery of XDS responses, faster service uptime, and reduce bandwidth costs from caching and other optimizations that we have for this project. XDS Relate will be configurable with rule-based definitions in order to specify a group of XDS requests that should get aggregated and cached to the same key. XDS Relate will maintain a GRPC stream to the control plane server for each of the unique keys. So Lyft's current control plane manages multiple facets, pre-processing of multiple sources of Lyft service metadata in order to generate the XDS responses, pulling from legacy discovery mechanisms in the Kubernetes API server in order to get endpoint information, and last but not least, caching and fanning out of discovery responses. So XDS Relate will pull out the connection management and caching aspect, allowing the control plane server to scale independently of XDS clients. We're also making a lot of optimizations into the transport layer in order to make the connections low latency and low throughput. XDS Relate will also have built-in common control plane observability mechanisms, including stats supporting multiple sinks, error warning and debug level logs, as well as admin endpoints for viewing the cache and other common control plane endpoint usability tools. So alongside upstream server retries, XDS Relate will implement mechanisms to stop a thunder and herd of requests from XDS clients through a queuing and rate limiting mechanism. Lastly, we want to build XDS Relate in a way that the components operate in a plug and play manner. We'll get into a lot of ambitious goals we have surrounding a general relaying type component later, but we understand that a feature rich set can be bloated for a company that wants to just run lightweight components. For that reason, we're conscious to make XDS Relate as accessible as possible. So at the heart of XDS Relate is rule-based definitions for request aggregation and caching. And here's an example where we've decided to cache on service and request type pairings. We use YAML Structured Match and Result compilers in order to create the unique aggregated keys. We won't dive into the specifics here because we'll be doing a demo shortly. We thought that it would be easiest to understand the architecture of XDS Relate by going through a workflow diagram. So on this slide, you can see the workflows numbered one all the way to six. When discovery requests first make their way into XDS Relate, they go through an aggregator component. This is the component that takes the rules we mentioned in the previous slide and translates these requests into unique discovery keys. So in this example, we've chosen to map both requests to the same key using a combination of the node ID, the cluster, and the request type. So once the aggregated key is generated for a request, the request gets added into an in-memory cache with configurable TTL and size limits. If there is already a response in the cache and the versioning is different, XDS Relate will immediately return the discovery response to the sender. However, if there's not a response available or the versioning is the same, XDS Relate will maintain an open watch for this request by storing it in the cache and it will wait for a new response from the upstream management server. So XDS Relate only ever maintains a single GRPC stream per cache key. And this coincides with the first unique request to log a cache entry for the key. The first request is propagated to the upstream management server through our upstream client and GoRoutines and channels are used in order to await discovery responses from the management server. Upon a response from the server, XDS Relate will then fan out the response to all of the XDS clients with an open watch in the cache for the specified aggregated key and then remove the watches from the cache. And all of these components are orchestrated by what we internally call the orchestrator. So at Lyft, we run a group of control plan servers on each Kubernetes cluster with services distributed across each of the clusters. And our control plan cluster was scaled up to run on multiple nodes in order to support a number of online clients we had at Lyft. With this new architecture, we're running a group of XDS relays on each Kubernetes cluster allowing us to scale down our control plan server and refocus the control plan core logic on pre-processing and generating new XDS responses. We're now also able to scale the response generation logic independently from the connection management portion. In this specific example, we've aggregated the request for the location service and the request for the user service to two unique cache keys. XDS Relay is responsible for caching and optimized response fan out. So this implies that the connections to the upstream server is also a lot smaller than they would be without the relaying type component. So Lyft runs all of its infrastructure on a giant VPC but we envision other interesting topologies for running XDS Relay on hybrid clouds and multiple VPC infrastructure setups. So for example, here's a topology where we run a cluster of XDS Relay on-prem so on this data center and on another data center physically close to the onward services while hosting the control plan server on a centralized VPC for faster services discovery. Now we're gonna get into a demo. All right, in this demo we'll be running a very simple setup with a management server that just sends snapshots every 10 seconds a XDS Relay server and two onward clients. So we'll begin by starting up our management server. There we have it. Snapshots are now being generated every 10 seconds. While that's running let's take a look at some of our configuration files that we'll be using for our onward clients and XDS Relay server. So we have two bootstrap files here for the onward clients and they're quite simple. One of them has the node ID onward client one. The other one has the node ID onward client two. They both run on slightly different ports. One runs on port 19,000. The other one runs on port 19,001. But everything else is the same. Most notably is the cluster. You notice that they both share the same staging cluster which is what we'll be using to define our XDS Relay aggregation rules later on. And they both designate the same control plan server to point to XDS Relay running on port 9,991. Let's just quickly confirm the bootstrap file of our second onward client. As you can see, it's identical except for the node ID and the port which it's running on. So now let's take a look at some of our XDS Relay configuration file. So as I mentioned, XDS Relay is gonna be running on port 9,991. It's gonna be pointing to our simple control plane server that's running on port 18,000. And there's some other miscellaneous server metadata including the log level, cache sizing, the admin endpoint, et cetera. Finally, let's take a look at the XDS Relay aggregation rules. And these rules might look quite intimidating at first but it's actually pretty easy to understand. So it's gonna end up generating keys that look like staging underscore EDS or staging underscore CDS. And how that works is via these fragments. So for the first fragments, we look at these request types that fall under LDS, CDS, EDS or RDS. And we apply this Regex match and replace operation on the node cluster. So because in our onward bootstrap files, we had the node cluster set to staging. This will actually always result in staging as the first cache key fragment. For the second fragment, it's just a static constant. So if it's of type listener, then we append LDS as the second string fragment. With CDS, we append the static constant CDS, so forth. And for the very last fragment, if this request type is of type route configuration, we're gonna append an additional fragment that has the resource name. So again, all that is saying is that we're gonna end up with these aggregated cache keys in XDS Relay that end up looking like staging CDS or staging EDS, et cetera. All right, so now let's start our XDS Relay server using those bootstrap configuration and aggregation rules that we just talked about. Okay. So if we were to curl the XDS Relay endpoint right now we're gonna notice that the cache is gonna be empty. And the reason for this is that we haven't had any onward clients hitting the cache yet. So there is no response being cached or any of the requests being cached. So let's start our two onward clients. This is the first client being started and you can see a bunch of activity happening in our XDS Relay servers since we have debug level logs turned on. So we started the second client. Again, some more activity. And now if we were to run the same curl on our cache endpoint, we'll see something different. So of note here is that we're using JQ to make the output more concise for this demo. As you can see, the response contains the version of the latest snapshot that was generated by the server. In this case, it's 95197 more seconds have passed since I've ran this query. So that's why we're not seeing the latest. But we can also see that both onward clients are cached in the requests, right? Our onward client two and onward client one. So since these clients fall under the same aggregation rule, if we were to observe the stats, we'd also know that there's only one GRPC stream being made to our upstream server despite there being two onward clients. We can also validate that the onward clients have received valid CDS information by querying on voice admin endpoint. So this is querying our second onward client. And indeed, you can see the latest cluster information. If we were to query the other onward client, we should receive the same information. You can see it very slightly because we've generated a new snapshot since, but if we quickly run both in sequence, you'll notice that their cluster information is the same. And this example is available on our GitHub. If you would like to try it out for yourself, please let us know your thoughts, thanks. So we're excited to announce that in the next few weeks here, we're gonna be releasing version one of XDS Relay. And this covers the MEP mechanisms that we showed in the demo and earlier. So beyond that, we're looking to create extensions, including a few here that we're pretty excited about. One is the state of the world to Delta XDS transformation. Very few control planes in the wild have implemented support for incremental, despite big performance gains. So rather than having all control plane servers make this migration, XDS Relay can implicitly make the conversion and cache the response Delta. Another is API driven configuration. Rather than having XDS Relay maintain connections to the upstream server, we wanna make it possible for operators to directly write to the XDS Relay cache in a push model when there's updated configuration information. Another one that we're particularly excited about is endpoint sub-setting. So in a topology that we mentioned earlier, it might not be a deal for XDS clients to be aware of clients running in a different VPC or data center. Similar to the aggregation rules, we're looking at creating a rule-based configuration where operators can use XDS Relay to send back a subset of the EDS information rather than all endpoints that exist from the control plane response. Another interesting use case is blue-green control plane deploys. So being able to use XDS Relay to determine a percentage of traffic that should roll out to the new control plane server. And this list goes on. In the interest of time, we won't be able to cover them all today. But as you can see, there's plenty of directions that we could and wanna take this project. So we're always looking for contributors and love talking to people curious about control plane implementations and their use cases. Please reach out to one of us directly or visit the on-way Slack here if you'd like to talk with us more. Lyft is also looking for engineers that have on-way experience. So if that's something that you're interested in, please reach out to me or Jody directly. Thanks. Awesome. Hi, everyone. I think we're both live. Can I answer that question? In the context of end points of setting, where do you draw the line between logic that belongs in XDS Relay and logic that should be in the control plane? Personally, I think this is gonna vary by a lot of the use cases that your company is implementing. I think the part that we wanna implement in XDS Relay is if it can be defined based on rules then it would make sense to live in XDS Relay. But if there is too much custom logic that should be generated from service metadata based on your company's logic, then it should fall in the control plane. The plugin model also allows us to define custom ways of splitting or sub-setting the end points. So yeah, that should work too. Jody, maybe you can take this next one just to work along the upstream. Yeah, in case the control plane is not present momentarily, we'll on-way continue to receive a configuration. The XDS Relay, all this depends on the upstream, but if the upstream is not present, we have retries and back off of mechanisms that we intend so that we keep trying until the control plane comes back, but it will keep serving contribution from the cache to the cycle. So the fleet will continue operating and there will be no disruption until the back end control plane comes back. I might have missed it, but how widely has this been used in the wild? Because as we haven't issued an initial release yet, I would say there's no customers using it in the wild. We have a few people from other companies that are testing it for us and we're internally releasing it within Lyft, but so far it's sort of in an early office stage. We are serving our staging traffic on this at this point and once the end production will probably do a MVP release. In case the back end control plane comes back, we invited the cache to release. So yes, when the back end control plane comes back, it will be a new stream from exchange relay to the back end control plane as if everything's fresh. So what are the updated version there is? It will reflect in the cache. I guess we're at time almost, unless there's any other questions. How many parts do you have in your clusters? It varies. We had times when we are like 50K parts or something, but I have no exact numbers to share at this point. Sorry for my ignorance here, but can you expand that acronym to? Yeah, probably we can take this offline. You could engage with us on the XTS relay Slack or on GitHub. I am afraid you'll be cut off at 1.30, so it was great sharing our information with you today. Thank you. Yeah, thanks everyone.