 Hi everyone. Welcome to our talk on how to validate Envoy configs at scale. We'll be going over strategies for automating the config validation and testing process so that your service owners can iterate quickly and independently. This will also prevent your Envoy operators from being bogged down with code reviews. My name is Lisa and I'm speaking here today with GoT. Both of us have worked on the router check tool which is one of the main tools we'll discuss here as well as internal tooling for making testing and validating of Envoy configs better while at Lyft. We're really excited to share our learnings with you all today. To start off I'll describe Lyft's infrastructure and how we generate and transport our many Envoy configs. Our bootstrap Envoy config and our route and cluster definitions are stored in a repository as YAML files which we template with Jinja. Service teams can add their routes to a Jinja and Python-based infrastructure which eventually is wired up with the control plane to deliver updates to the front proxy. Upon container startup the Envoy binary and bootstrap config get pulled and the static bootstrap config is generated with Jinja. Sidecars in the internal mesh on the front proxy and the egress proxy are started with this. To avoid rolling Envoy every time there is a new route that gets added we use RDS or the route discovery service to serve routes from a Go control plane based control plane. The bootstrap config is configured to request XDS from our control plane. Meanwhile the routes and clusters which are stored in S3 are fetched by the control plane which creates the respective route configuration and cluster objects to send via XDS to the sidecars. As such there are two main points at which we have to validate the format and content of the configs. The first such place is in the bootstrap config which ensures that the sidecar is able to run properly and start up without crashing. The other place we need to validate is in the route and cluster definitions. We validate these before sending them to the control plane because we want to avoid the control plane needing to do any validation as it gets the route and cluster information in real time and rather frequently. As the number of services and routes that Lyft has grown on some sidecars we have configurations that are upwards of 100,000 lines of YAML which makes maintenance and modification extremely complex and risky. I'll describe some of the issues in fact that we've encountered over the last year. Firstly as a growing company we have new developers joining every week and business expansion also means that new services are being spun up to cater to new scenarios quite frequently. This means more code review requests from developers who aren't familiar with Envoy and want to know if their change will do what they want it to do. While it's great that we're having more and more development it's not so great that the on-call burden has increased. It means the on-call spends more time triaging these tickets and having to stare at YAML. Another issue is human error in configs. With such large YAML files it isn't feasible for catching everything. For example one common source of bugs is how Envoy does first order matching for its routes which means that in a huge list of routes it might be easy for a developer or a viewer to miss a route that captures all the traffic that is actually intended for a new route. These can lead to outages in your services. In response to such errors we actually mandated a code review from the networking team for all routes at one point. This quickly became the bottleneck for iteration because our networking team is about six people and while we were the guardians of these route changes for a while and tried to eyeball every change going in this just wasn't proven to be sustainable and was slowing iteration and rollouts for the rest of the company. Next we have this issue where we just saw tech debt piling up. What dead routes and dead clusters refer to are unused routes and unused clusters for example clusters that are no longer necessary or can't be routed to or routes that point to services that no longer exist. This is quite common with deprecation service owners might delete the cluster config and forget to delete the routes or vice versa leaving and clutter and causing the control plane to continue creating route and cluster definitions that are no longer needed. Next reason is that changes are hard to test actually this was a main reason for the first issue that I highlighted which is that service owners weren't sure how to validate their changes which is why they wanted to reach out to the networking team in the first place prior to automating or testing it was really difficult to validate changes you had to have an end-to-end setup with setting up an envoy sidecar and sending requests to see if your config changes worked. Finally Lyft recently switched to pulling in the latest envoy version on a weekly basis and while this is great for minimizing tech debt as we are staying up to date quite frequently it runs the risk of always dealing with incompatible config changes and internal stability. When we say config here we mean both the envoy bootstrap config and the XDS configuration an error in either of these could cause an envoy sidecar to reject the config and fail to start up or fail to apply updated XDS objects and in order to maintain this high speed of iteration we knew we would have to automate our testing in order to make sure that our envoys wouldn't regress whenever we pulled in the latest version. Of all these issues in mind we set out to invest in tooling and tests that would make config deployments safe and self-service. The first strategy I'll talk about here is how we addressed tech debt in particular the dead clusters and routes. We wrote some in-house scripts that parse the route and cluster configs and remove the ones that were dead. So in this scenario ideally in the top row you would have routes pointing to your service or cluster but when it comes to new service launches or service deprecations sometimes developers only address the route definitions and forget the clusters or vice versa. So for example in the second row you see that there's no route definition pointing to the service and this could mean that the developer didn't leave in any routes which means the cluster definition isn't being used or they deprecated the routes and forgot to remove the cluster. While this cluster is not accessible by outside users your control plan is still creating unnecessary resources here and sending them to sidecars. In addition it just creates tech debt and can make reading through configs confusing since the service is presumably no longer being used or isn't currently being used. And then in the third scenario you have accessible routes that are pointing to a cluster that isn't configured or doesn't exist. In this scenario Envoy will return a 503 to request going to these routes because it doesn't see any healthy members for this cluster and obviously you don't want your users to be hitting 503s for no reason. So both checks here can prevent human error from derailing a service launcher deprecation which usually happens when someone forgets one of the configs. Next to validate the bootstrap config we make use of a couple of open source tools. The first one I'll discuss is the validation server. This refers to running the Envoy binary in validation mode which you can set via the command line flag mode. And what this does is it takes your binary and it takes your bootstrap config and it tries to boot up Envoy without starting workers but it will go through the server initialization process as far as it can and if there are no errors it will exit successfully. The next tool we use is the config load check tool and you can see the docs on the left. This is a standalone binary which you can run on all of your bootstrap configs and it basically checks that all the values in the fields are valid as defined by the proto and that the schema is a valid Envoy schema. We run both of these checks on all of our pull requests in CI and that just makes reviewing these changes so much easier because as the reviewer you already know that the config is a valid Envoy config and so you only need to ensure that the change does what it should. I'll now hand it over to Jyoti to discuss the router check tool and its various functionalities. Thank you Lisa. Next I'm going to talk about the router check tool. The part of configuration in most flasks over the last few years have been modifications to the routes. Routes work in a very sensitive way. The routing engine runs an incoming request or a set of routing rules and the first one to match wins. This becomes risky in a high flux change scenario a route mistakenly added at the top of the list can black hole all traffic and cause incidents. Envoy has a router check tool executable. This exercises the routing engine and runs it over the routing configurations given. It lets us add unit tests, check field applications, add code coverage constraints, test complex routing configurations based on header match, runtime and wait cluster conflicts. We'll look at how to write the tests. Adding tests for code is a well-known pattern. There's a subject under test defined by test name, some setup and assertions. Similarly, routing tests have its name, a set of input conditions and a set of assertions. The test always runs the input route through the routing engine and compares the assertions with the actual results. Let's follow an example. Imagine we have a routing table assured on the left picture. Imagine thousands of such routes. One can imagine the plight of the developer who is making changes to such a long config. We need a way to prevent regressions and have the developer make change and find out any mistake in the PR. On the right there are a few examples of how a test looks like. It starts the test name. There's an input section which works as a test setup. It has the URL, the method, headers, etc. The tool has a strict set of assertions. The tool runs the config through the routing engine and matches the resulting cluster name, path or host address and redirects. This was great and we took all our routes and added tests for all of them. We used our telemetry and automated the test generation too. This means an existing route can never regress anymore. We do a bad change and we can detect them right in the PR. Let's run the test and see how it looks like. The tool takes a routing configuration and a test file. It shows the success and failure using exit code. Exit code 0 is success and anything else is a failure. The tests aren't verbose by default. You could add verbosity by adding the details flag on the command line and check the results using the taller question mark which shows the last exit code. When we have hundreds of routes, adding verbosity cost more friction than we wanted. The shift was new and developers did not understand the testing semantics. We added another flag, only show failures and the PR failure logs would have exactly the tests that failed, making failures easier to understand. It prints the other test names and all the assertions that failed. All this was exciting, but the tool did not have a way of enforcing adding tests. Developers were able to work around the stop gap phase of test enforcement. We decided we needed a way to add code coverage to the tool. There are two ways of adding coverage. First one is shallow and just checks that there is a test for every routing rule. The second one is more strict and enforces writing all assertions for each test. This gives us the comfort that all assertions have been tested by the developer and they know what they intend to do. In the example, the third box shows an example of a failing test which is complaining about low code coverage and tells us which particular assertions are failing. Like any production system, our infra is based on runtime configs for safe rollouts. Developers needed a way to flip the runtime in config and test how the routes behaved. We added new fields in the input to crank the runtime using some random values and test the output. We had now almost 100% code coverage in our system. Routes behave differently based on headers if you wish them to. We needed a way to enforce testing on these routes. We can supply headers in the test setup and test the behavior of the routes. Constantly updating the on-way binary comes at a cost. There are fields getting deprecated all the time and having them in the system introduces detect it and a higher migration time later. We added a deprecated field check in the tool so that it fails whenever it observes a deprecated field. We quickly resolve it since usually it's one or two fields at a time and doesn't stay long enough to become risky. Adding tests for untested parts in the system is a culture shift. It needs help from developers to test it and also a few curious ones to work around it so that we can be, we can put in better enforcement. We had a bumpy ride while putting it in our system but in the end it was a win-win for everyone. The networking team saved the time spent by on-calls, eyeballing routing configurations and helped developers to ship stuff faster. I'll give it away to Lisa to talk about the future directions. Great. Thanks, Jyoti. So hopefully through this presentation you've seen the different ways you can utilize open-source tooling or writing your own scripts to automate the testing and validation of Envoy configs. And while this is really powerful, I mean it's certainly helped keep up with our huge Envoy configs. There's still a lot of room for improvement and how we build the tooling to test these. So first off, Envoy has a ton of different features but there's still not test support for all of these features. At least that lift are contributions that have mostly been spurred by demand from our developers. So for example, the router check tool didn't always have the capability to calculate coverage or have test support for runtime values and flags or header manipulation and checking header values. And while those are supported now, there's still a bunch of features that users want such as cores or checking that your direct response rat returns the expected status code. And so there's definitely plenty of room to increase the kind of behavior that can be tested with the router check tool. Building off of this, one major improvement would be utilizing production code. Currently the router check tool uses the same routing function as Envoy in production itself, but when it comes to things like header validation, it's more or less a copy of the code that runs in production. And this is also one of the main roadblocks to implementing something like cores testing support. Because ideally you want to be able to use the same code as production so that testing support is consistent and doesn't diverge. And it also just makes adding test support a lot less fertile because as you may know, Envoy, open source Envoy gets a lot of changes every day. And so just copying the code over from production is not very feasible. And then finally, I think the ideal state of the tooling would be having true blockbox testing. What this would mean is just having to input a full Envoy config versus knowing specific route inputs for your unit tests beforehand. And this way users could simulate request behavior without having to come up with the certain test cases they want and just inspect the resultant response. And this could be used hand in hand with the existing unit testing flow. But I think this particular approach also runs itself well to a UI-based way of testing the routing table in which you can just input some parameters and see what the resultant behavior is. And that might be something that's more intuitive for service owners instead of having to fill out the values for very specific Envoy route configuration fields. So yeah, definitely a lot of room for improvement in the router check tool. And we'd love to chat if you are interested in contributing. With that, that concludes our talk. And please let us know if you have any questions. Thank you. Hi. There are questions around YAML. So yeah, we have a lot of YAMLs we are temperatized using JINJA. And we put together a part of our PR outputs or artifacts and they go land on the side cars. Yeah, a lot of YAMLs. And the question about being a standalone tool, so it's part of a different folder test tools and can use a different build target to build the EXE and land in your PR flows and give it the files that will just exist in the test plan with not Envoy. The deployment takes, yeah, so we do per AZ deployment. And the side car, the YAMLs and templates are already redefined for production staging and development. So that doesn't take our time. But yeah, we do several routes probably in the span of five to 10 minutes per AZ. Please remember most times. Yeah, I think around 10 minutes per AZ sounds right. Do you know why we went with JINJA or historical context for that? It's probably an artifact of how it was written a few years back and we haven't moved it around yet, but probably that. So our control plane will fetch the Envoy routes. There's like a file that we write the routes 2 and S3 and I'll sync that at a given time interval and then it will create the route configuration objects from that information and then send it down to the side cars. And we also use a dipping mechanism based on a manifest version. So only if there's a diff and there's three files only then it gets applied otherwise it's a no. Yeah, it's on an interval. Yeah, about the, so just not using latest side conversions. We had double there. We had to chase services which were using an older side conversion. Right now there are mechanisms in Envoy. We could use to serve both v2 and v3 fields using at runtime. It was landed last month, I guess, but it was not available two or three months back. So we ended up migrating all of our side cars to use the latest Envoy and then it worked out. But we make sure using the derivative field check that there are no derivative fields in our configs and they don't pass PRs if they are. Why are the Envoy files so large? Lisa, do you want to take that? Yeah, so I mean there are some services that lift that sort of need to communicate with the other services so as a result like lots of clusters and route information for them. But as a number of services has grown, we just have a lot of user-defined routes and a lot of sort of default config content that gets added to every services config. So I think over time when it's just grown a lot in size. And I guess to be explicit, the deprecation check is a flag that you can pass in for the router check tool. Yeah, we struggle with v2 and v3 quite a bit, but we're about the hump, I guess. The next question is about how are they stored in the GitHub? We have a separate repo just for storing the routes so that people can add and change routes as per their will. We have different mechanisms. We have JSON-based configs that we parse and create XTS requests out of. We also have configs which replicate exactly the routes and we just deserialize them and pass it on to the sidebars. So we grow both ways there. Sometimes in order to reduce the north that we expose to the developers, we only allow a particular JSON format so that they can put specific fields they want to override and everything else is defaulted. But with routes, it's probably not very useful because people don't go anywhere and there will be probably tens of quantities of north. So we let that be as a loud configuration itself in the config. Yeah, front proxy, yes. So that covers a lot of our edge traffic. So a lot of the handle there. So we have two ways of doing traffic. One is through service manifest. People can write configs on their own services manifest and the control plane has a way of getting those from the S3. And when the deploy to development, they can change only their services configs and the control plane in-depth can pick it up. Like now they don't have to check in configs specifically. They can do a branch and work on top of it in development. I guess you're at time almost one minute. So it was great talking about this. And thank you guys for listening. Feel free to reach out to us.