 Hi everyone, welcome to our talk on how to validate Envoy configs at scale. We'll be going over strategies for automating the config validation and testing process so that your service owners can iterate quickly and independently. This will also prevent your Envoy operators from being bogged down with code reviews. My name is Lisa and I'm speaking here today with GoT. Both of us have worked on the router check tool, which is one of the main tools we'll discuss here as well as internal tooling for making testing and validating of Envoy configs better while at lift. We're really excited to share our learnings with you all today. To start off, I'll describe lift's infrastructure and how we generate and transport our many Envoy configs. Our bootstrap Envoy config and our route and cluster definitions are stored in a repository as YAML files, which we template with Jinja. Service teams can add their routes to a Jinja and Python based infrastructure, which eventually is wired up with the control plan to deliver updates to the front proxy. Upon container startup, the Envoy binary and bootstrap config get pulled and the static bootstrap config is generated with Jinja. Sidecars in the internal mesh on the front proxy and the egress proxy are started with this. To avoid rolling Envoy every time there is a new route that gets added, we use RDS or the route discovery service to serve routes from a Go control plan based control plan. The bootstrap config is configured to request XDS from our control plan. Meanwhile, the routes and clusters which are stored in S3 are fetched by the control plan, which creates the respective route configuration and cluster objects to send via XDS to the sidecars. As such, there are two main points at which we have to validate the format and contact the configs. The first such place is in the bootstrap config, which ensures that the sidecars able to run properly and start up without cracking. The other place we need to validate is in the route and cluster definitions. We validate these before sending them to the control plan because we want to avoid the control plan needing to do any validation as it gets the route and cluster information in real time and rather frequently. As a number of services and routes that Lyft has grown on some sidecars we have configurations that are upwards of 100,000 lines of YAML which makes maintenance and modification extremely complex and risky. I'll describe some of the issues in fact that we've encountered over the last year. Firstly, as a growing company we have new developers joining every week and business expansion also means that new services are being spun up to cater to new scenarios quite frequently. This means more code review requests from developers who aren't familiar with Envoy and want to know if their change will do what they want it to do. While it's great that we're having more and more development, it's not so great that the on-call burden has increased. It means the on-call spends more time triaging these tickets and having to stare at YAML. Another issue is human error in configs. With such large YAML files it isn't feasible for catching everything. For example, one common source of bugs is how Envoy does first order matching for its routes. Which means that in a huge list of routes it might be easy for a developer or a reviewer to miss a route that captures all the traffic that is actually intended for a new route. These can lead to outages in your services. In response to such errors, we actually mandated a code review from the networking team for all routes at one point. This quickly became the bottleneck for iteration because our networking team is about six people. And while we were the guardians of these route changes for a while and tried to eyeball every change going in, this just wasn't proving to be sustainable and was slowing iteration and rollouts for the rest of the company. Next, we have this issue where we just saw tech debt piling up. What dead routes and dead clusters refer to are unused routes and unused clusters, for example, clusters that are no longer necessary or can't be routed to or routes that point to services that no longer exist. This is quite common with deprecation service owners might delete the cluster config and forget to delete the routes or vice versa, leaving and clutter and causing the control plane to continue creating route and cluster definitions that are no longer needed. Next reason is that changes are hard to test. This was a main reason for the first issue that I highlighted, which is that service owners weren't sure how to validate their changes, which is why they wanted to reach out to the networking team in the first place. Prior to automating or testing, it was really difficult to validate changes. You had to have an end to end setup with setting up an envoy sidecar and sending requests to see if your config changes worked. Finally, Lyft recently switched to pulling in the latest envoy version on a weekly basis. And while this is great for minimizing tech debt as we are staying up to date quite frequently, it runs the risk of always dealing with incompatible config changes and internal stability. When we say config here we mean both the envoy bootstrap config and the XDS configuration, an error in either of these could cause an envoy sidecar to reject the config and fail to start up or fail to apply updated XDS objects. And in order to maintain this high speed of iteration we knew we would have to automate our testing in order to make sure that our envoys wouldn't regress whenever we pulled in the latest version. Of all these issues in mind, we set out to invest in tooling and tests that would make config deployments safe and self service. The first strategy I'll talk about here is how we addressed tech debt, in particular the dead clusters and routes. We wrote some in house scripts that that parsed the route and cluster configs and remove the ones that were dead. So in this scenario, ideally in the top row, you would have rats pointing to your service or cluster but when it comes to new service launches or service deprecations, sometimes developers only address the route definitions and forget the clusters or vice versa. So for example, in the second row, you see that there's no route definition pointing to the service and this could mean that the developer didn't leave in any routes which means the cluster definition isn't being used, or they deprecated the routes and forgot to remove the cluster. While this cluster is not accessible by outside users, your control plan is still creating unnecessary resources here and sending them to side cars. In addition, it just creates tech debt and can make reading through configs confusing since the services presumably no longer being used or isn't currently being used. In the third scenario, you have accessible routes that are pointing to a cluster that isn't configured or doesn't exist. In this scenario, Envoy will return a 503 to request going to use routes because it doesn't see any healthy members for this cluster. Obviously, you don't want your users to be hitting 503s for no reason. So both checks here can prevent human error from derailing a service launcher deprecation which usually happens when someone forgets one of the configs. Next, to validate the bootstrap config, we make use of a couple of open source tools. This one I'll discuss is the validation server. This refers to running the Envoy binary in validation mode, which you can set via the command line flag mode. And what this does is it takes your binary and it takes your bootstrap config and it tries to boot up Envoy without starting workers, but it will go through the server initialization process as far as it can. It will exit successfully. The next tool we use is the config load check tool and you can see the docs on the left. This is a standalone binary which you can run on all of your bootstrap configs and it basically checks that all the values in the fields are valid as defined by the proto and that the schema is a valid Envoy schema. We run both of these checks on all of our pull requests and CI, and that just makes reviewing these changes so much easier because as the reviewer, you already know that the config is a valid Envoy config and so you only need to ensure that the change does what it should. I'll now hand it over to Jyoti to discuss the router check tool and its various functionalities. Thank you Lisa. Next, I'm going to talk about the router check tool. The part of configuration and most flux over the last few years have been modifications to the routes. Routes work in a very sensitive way. The routing engine runs an incoming request or a set of routing rules and the first one to match wins. The routing engine's risky in a high flux change scenario around mistakenly added at the top of the list can black hole all traffic and cause incidents. Envoy has a router check tool executable. This exercises the routing engine and runs it over the routing configurations given. We'll also add unit tests, check field deprecations, add code coverage constraints, test complex routing configurations based on header match, runtime and wait cluster conflicts. We'll look at how to write the tests. Adding tests for code is a well known pattern. There is a subject under test defined by test name, some setup and assertions. Similarly, routing tests have its name, a set of input conditions and a set of assertions. The test always runs the input route through the routing engine and compares the assertions with the actual results. Let's follow an example. Imagine we have a routing table as shown on the left picture. Imagine thousands of such routes. One can imagine the plight of the developer who is making changes to such a long conflict. We need a way to prevent regressions and have the developer make change and find out any mistake in the PR. On the right there are a few examples of how a test looks like. It starts with the test name. There is an input section which works as a test setup. It has a URL, the method, headers, etc. The tool has a strict set of assertions. The tool runs the conflict through the routing engine and matches the resulting cluster name, path or host address and redirects. This was great and we took all our routes and added tests for all of them. We used our telemetry and automated the test generation too. This means an existing route can never regress anymore. We do a bad change and we can detect them right in the PR. Let's run the test and see how it looks like. The tool takes a routing configuration and a test file. It shows the success and failure using exit code. Exit code 0 is success and anything else is a failure. The tests aren't verbose by default. You could add verbosity by adding the details flag on the command line and check the results using taller question mark. We chose the last exit code. When we have hundreds of routes, adding verbose cost more friction than we wanted. The shift was new and developers did not understand the testing semantics. We added another flag, only show failures and the PR failure logs would have exactly the test that failed, making failures easier to understand. It prints the test names and all the assertions that failed. All this was exciting but the tool did not have a way of enforcing adding tests. Developers were able to work around the stop gap ways of test enforcement. We decided we needed a way to add code coverage to the tool. There are two ways of adding coverage. First one is shallow and just checks that there is a test for every routing rule. The second one is more strict and enforces writing all assertions for each test. This gives us the comfort that all assertions have been tested by the developer and they know what they intend to do. In the example, the third box shows an example of a failing test, which is complaining about low code coverage and tells us which particular assertions are failing. Like any production system, our infra is based on runtime configs for safe rollouts. Developers needed a way to flip the runtime in config and test how the routes behaved. We added new fields in the input to crank the runtime using some random values and test output. We had now almost 100% code coverage in our system. Routes behave differently based on headers if you wish them to. We needed a way to enforce testing on these routes. We can supply headers in the test setup and test the behavior of the routes. Constantly updating the on-way binary comes at a cost. There are fields getting deprecated all the time and having them in the system introduces dictate and a higher migration time later. We added a deprecated field check in a tool so that it fails whenever it observes a deprecated field. We quickly resolve it since usually it's one or two fields at a time and it doesn't stay long enough to become risky. Adding tests for untested parts in the system is a culture shift. It needs help from developers to test it and also a few curious ones to work around it so that we can put in better enforcement. We had a bumpy ride while putting it in our system, but in the end it was a win-win for everyone. The data team saved the time spent by on-calls, eyeballing routing configurations, and helped developers to ship stuff faster. I'll give it away to Lisa to talk about the future directions. Great. Thanks, Jyoti. Hopefully through this presentation, you've seen the different ways you can utilize open-source tooling or writing your own scripts to automate the testing and validation of Envoy configs. And while this is really powerful, I mean it's certainly helped keep up with our UGML configs. There's still a lot of room for improvement and how we build the tooling to test these. So first off, Envoy has a ton of different features, but there's still not test support for all of these features. At least that lift are contributions that have mostly been spurred by demand from our developers. So for example, the router check tool didn't always have the capability to calculate coverage or have test support for runtime valleys and flags or header manipulation and checking header values. And while those are supported now, there's still a bunch of features that users want such as cores or checking that your direct response rat returns the expected status code. And so there's definitely plenty of room to increase the kind of behavior that can be tested with the router check tool. Building off of this, one major improvement would be utilizing production code. Currently the router check tool uses the same routing function as Envoy in production itself, but when it comes to things like header validation, it's more or less a copy of the code that runs in production. And this is also one of the main roadblocks to implementing something like cores testing support. Because ideally you want to be able to use the same code as production so that testing supports consistent and doesn't diverge. And it also just makes adding test support a lot less brutal because, as you may know, Envoy, open source Envoy gets a lot of changes every day. And so just copying the code over from production is not very feasible. And then finally, I think the ideal state of the tooling would be having true black box testing. What this would mean is just having to input a full Envoy config versus knowing specific route inputs for your unit test beforehand. And this way users could simulate request behavior without having to come up with the certain test cases they want. And just inspect the resultant response and you know this could be used hand in hand with the existing unit testing flow but I think this particular approach also runs itself well to a UI based way of testing the routing table, and which you can just input some parameters and see what the resultant behavior is and that might be something that's more intuitive for service owners instead of having to fill out the values for very specific Envoy route configuration fields. So yeah, definitely a lot of room for improvement in the router check tool and we'd love to chat if you are interested in contributing. With that, that concludes our talk and please let us know if you have any questions. Thank you.