 Well, hello everybody. Good afternoon. My name is Adi and I'm here with Teju and today we're gonna talk about fuzz testing of Envoy, but before we're gonna go into the details of how we use fuzzing to make Envoy more robust and secure, let's give a bit of a background about fuzzing. So for a typical software component, we as developers add functional tests that verify that our code is correct. So we usually add all kinds of tests that validate that some input one has the expected output one, and we add more than just one input one, right? We add input two and input three and so on and so on. However, our functional tests sometimes miss some corner pieces, specifically those that validate whether there is some bad input that can cause our program to crash. So this is what fuzz testing wants to address. Fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program. The fuzzer framework takes the code and instruments to track which code blocks are exercised during the fuzzing process. The fuzzer reads inputs from a corpus directory. These inputs are used as an input C to an input generator. The generator then creates an input and the fuzzing infrastructure executes the code with a given input. If an error occurred during the execution, an issue is added to a database. Otherwise, the input generator takes the previous input and the code coverage report and uses them to propose the next mutated input. There are three main categories of errors that the fuzzers attempt to detect. First is related to memory safety errors such as using an object after it is freed or other types of data races that may occur. The second is verifying that resource usage is bounded. For example, detecting infinite loops or deadlocks or stack overflows. The third is plain old crashes such as null pointer dereferencing, failed assertions, or any other segmentation faults. Some of our fuzzers were very useful in detecting severe bugs in the code that ended up being CVs. They typically find bugs that are hard to detect just by reviewing the code. Our fuzzers also detected non-CVE bugs and used to detecting some of these and were continuing to monitor the fuzzy issues and address them as possible. Next I'll give it to Teju to talk about writing your fuzzers test. So with that background, let's do a concrete walkthrough of writing a fuzz test and on board. We'll use the access logger library as the library under test. This is a great fuzz target because its functionality is split over two components, the config plane and the data plane. In the config plane, the library parses the input access log format string into a list of internal formatters. Then in the data plane, each formatter runs on every single request and response. Each formatter extracts information from the headers, trailers, stream info to create the final string. Our goal is to write a fuzz test for both functionalities. There's three main steps we follow to create a fuzzer and on-board proxy. First we define the fuzz input schema. Next we write the CC fuzz test and finally we add in the initial corpus. Let's walk through these one by one. So first, fuzz input schema. The input schema essentially indicates to the fuzzing engine the types of data to generate. Our goal is to make an input schema that captures the true input space of the library under test. In Envoy, we use protocol buffers to write input schemas. That's something specific to our repo. Other open source projects have various schema definitions that we've standardized on protos here. So let's consider what the inputs for the access logger library are. The most evident input is the access log format string. We want the fuzzing engine to generate random format strings. To indicate this, we add in a string field into the protocol buffer message. Now when the fuzzing engine runs, each iteration will generate a random string. For example, maybe a string with a valid access log directive, a string with a melt form directive, strings with special characters, escaped characters, or even a plain old empty string. All of these are valid inputs the access logger should handle. Now that only fuzzes the config path of the access logger. We also want to fuzz the data path. To do so, we need the fuzzing engine to create the request and response information. Such as the request headers, response headers, response trailers, and the stream info. We add all these fields into the fuzz input schema. Notice here that these fields are references to other protocol buffer messages under the test.fuzz package. These messages are actually shims or wrappers that we've created that allow the fuzzing engine to generate all this data. Later we'll see how these protocol buffer messages are translated to the native Envoy C++ objects, like the header map info, trailer map info, and stream info. So with this step, we have a fuzz schema that captures the full input space of the access logger. We're ready to write our fuzzer. We make use of the standard lib protobuf mutator macro to essentially define a callback function. In the background, the fuzzing engine is repeatedly generating a value for the input and passing it to the callback. Our goal is to plumb that input value down to the library under test. So first, we fuzz the config path. We plumb down the access log format string from the input into the parse function of the access logger. The output is a list of formatter objects. Now we're ready to start fuzzing the data path. First, remember that we had those shims in the input schema. We need to translate those shims into the native Envoy objects, like the Envoy stream info object. We have utilities in the Envoy fuzzer libraries to help with this translation. Once we have all that request and response information, we fuzz the data path. We loop over the formatters and we call the format method on each one. Passing in the request headers, response headers, response trailers, and stream info. Do note here that we're not checking the final output of the format function. Remember, this is the fuzz test. All we're trying to do is generate as many valid inputs as possible and pass them to the library under test. We're not trying to verify the correctness or the functional behavior of the access logger. With these two steps, we have a working fuzzer, but we haven't done the last step yet, which is adding in the initial corpus. The initial corpus is important for the fuzzing engine to generate realistic values. Let's think about that for a second. For example, in the fuzz input schema, we indicated to the fuzzer that we wanted to generate random access log strings. But we never told the fuzzing engine what a valid access log format directive looks like. So the fuzzing engine will have to do a lot of trial and error brute force to find realistic inputs, realistic values. We can optimize this by adding in the initial corpus. The initial corpus is a set of example files that we check into the source code alongside the fuzzer. Each example file is the instantiation of the fuzz input schema in text proto format. In the first example here, I basically show the fuzzer what the response code formatting directive looks like, and I add in a valid HTTP response code into the stream info. Similarly, I add in another example with upstream local address directive, and then I filled in a valid IPv4 address and a valid port number into the stream info. With these inputs, the fuzzing engine can generate new inputs via mutation, and it will generate new realistic inputs. For example, this is a possible mutation the fuzzing engine might do in creating a new input. So it could take the two access log formatting directives and append them together to one string. It could copy over the response code information, but flip a single bit, resulting in an invalid HTTP code. And it might even just completely mutate the address, putting in special characters in the address, zeroing up port value. So this input looks nonsensical to us as developers, but it's still a valid input. The fuzzing engine will pass this to your fuzzer, and your fuzzer should ensure the access log or library can handle this gracefully, without any crashes and without any undefined behavior. With that, I'll pass it back to Adi to discuss the remaining lifecycle for that fuzzer. Okay, so let's talk about the infrastructure and the lifecycle of the fuzzer. We're using OSS Fuzz, which is a project by Google that facilitates the execution and management of fuzzing for open source projects. It does continuous fuzzing by repeatedly fetching the latest version of the project source code from GitHub, building its fuzzers, and executing them and tracking any open bugs. Envoy developers just plainly submit pull requests to the GitHub repo and either add new features or update existing ones. On a daily basis, OSS Fuzz fetches the Envoy proxy repo's main branch and builds all the fuzzers. It then uses another project called clusterfuzz, which is also provided by Google to execute all the fuzzers and detect errors. Whenever an error occurs, the bug is added into the monorail system and is tracked by it. A security engineer can then triage the bug and assess whether it has a vulnerability in how it should be addressed. The fix itself can be done either by the security team or by any other developer. The development workflow is a continuous process that is composed of three phases. The first is the development of new fuzzers or the update of current fuzzers or a current fuzzer or its corpus. This is done by contributions to the Envoy proxy repository on GitHub. The second is the continuous fuzzer execution infrastructure, OSS Fuzz in our case, that takes the code and the corpus and executes the coverage-guided fuzzers. Whenever it finds a new bug, it stores it in a database along with the input that causes and notifies us. Finally, the bugs are triaged and the root causes determine along with their severity and how they should be fixed. We also add the fuzz test case to the corpus, and it is being used as both a regression test and as a seed for the next fuzzing iterations. I'll now pass it to you to talk about some best practices when creating fuzzers. Okay, so these are best practices that are specific to Envoy, and they're things that Audie and I have learned from our experience writing fuzzers in the past three years. One key point I want to stress is that we're not trying to fuzz every single target in Envoy proxy. Adding a new fuzz test has some computational cost and a little bit of maintenance burden. Instead, we prioritize what targets to fuzz. We prioritize based on two main attributes, the traffic type and the complexity. We'll break those down. Traffic type is essentially asking what type of actor is your code exposed to? For example, if your library is completely on the config path, it's probably lower priority to fuzz because we trust our configs. We trust the person deploying Envoy or we trust the XDS server we're getting our configs from. If your library is on the data path, it's worth further breaking down the trust level by your connection. So if you have Envoy deployed as a gateway, perhaps you have untrusted downstream clients. If you have Envoy in a multi-tenant model, perhaps you have untrusted upstream clients. Or if your library makes out any other external service calls, do you trust that interaction with that external service? It's really important to consider that because you want to focus on fuzzing inputs that are untrusted inputs from malicious actors that are trying to break your deployment. Please reference Envoy Threat model if you're unsure what type of actor your code is rated for. A good example of this is the GRPCJ SoundTranscoder. The Transcoder is a HTTP filter that has both decoder and encoder paths. If you look at the Envoy Threat model, the Transcoder is classified as follows. Robust to untrusted downstream but assumes trusted upstream. So with that classification, we know that all the malicious clients, or we assume all the malicious clients are on the downstream side. When we were adding fuzz tests for the Transcoder, we focus on just fuzz testing the downstream. We still added fuzz tests for the upstream and encoder code paths, but we really cared about the code coverage on the downstream, or decoder code paths. That was able to reduce our fuzz work in half for the Transcoder. Going back to prioritization, the other attribute to prioritize bias complexity. This is fairly obvious. We want to fuzz code that's intricate, such as parsers, cryptography, deserializers, any code that has historically been the source of security bugs. Good examples of this are HTTP codecs and route matching. So now we know what targets to fuzz an Envoy. But before you write your own fuzzer, you should understand what type of target you're fuzzing, because the type of fuzzer might change. First of all, we have utility fuzzers and Envoy. These are fuzzers that fuzz parser-like libraries, such as the access logger walkthrough we just did. These are pretty easy to write, very efficient, very fast, and high-signal. We recommend that 100% of utilities have fuzzers in our repo. The second type of fuzzer is configuration fuzzers. These focus on fuzzing just the config surface. Config surface, like the server initialization and the XGS inputs. These are also easy to write, but consider the threat model. They might be lower priority because we trust their inputs. They might be lower priority because we trust their configs. Things get more interesting with the third fuzzers. These are HTTP data plane fuzzers. These fuzz various portions of the data plane, from HTTP codecs to routing, header processing. They're pretty complex and they come in a lot of different styles. When we were writing some fuzzers in this category, we found that they were not very accessible to the open source community. We improved this by breaking out a subset of these fuzzers into their own category, the HTTP filter fuzzers. So these fuzzers follow a standardized framework to fuzz any HTTP filter. We created a framework to simulate decode and encode calls to the filter under test. We also added helpers to generate headers, trailers, HTTP data, GRPC data. These are the same helpers you saw in the walkthrough. With this new category, any open source contributor or any filter maintainer can add their own fuzz tests. Finally, let's talk about some technical trade-offs you'll face when you actually implement your own fuzzers. Those trade-offs are realism, efficiency, and maintenance. We'll walk through these one by one. Realism is essentially how much code coverage your fuzzers have for the library under test. It's a proxy for how well your fuzz input schema matches the true input space of your library. It's important that the input space matches. If your fuzzing engine isn't generating correct inputs, you probably won't be triggering edge cases in your underlying library. You can fix that by expanding out your fuzz input schema. But when you do that, you're making your inputs much more complex. This makes it hard for the fuzzing engine to generate new values. When you do that, you have a loss of efficiency. Efficiency is the fuzz rate. How many times a second the fuzzing engine can generate a new value and feed it to your instrumented code? Efficiency is also important. If you don't have high efficiency, the fuzzing engine doesn't have time to explore new input space, new state space. You want to maintain high efficiency. You can usually improve efficiency by adding in some domain specific knowledge. For example, some config validation checks or some short circuiting in your fuzzer to reject uninteresting inputs early before they're passed to your instrumented code. When you start adding in all these checks, you end up with higher maintenance burden. Now your fuzzer is scattered with tons of tiny checks. You have a lot of code complexity. And if you ever need to change your library under test, you also need to change your fuzzer. You have to make sure the assumptions between the two match. So there's no one solution that fits every single fuzzer. Our best advice is to start by writing a fuzzer that's easy to maintain and efficient, but maybe not so realistic. Let the fuzzer run on OSS fuzz for a few days and then later measure the code coverage. If you're unhappy with the code coverage, you can look at ways to increase the fuzz input schema while trying to deal with the loss in efficiency by adding in some optimizations. So how can the community contribute to the fuzzing efforts? So the first question we usually get is who can write fuzz tests and help make Envoy more robust and more secure? The answer is basically anyone. The core components of Envoy are typically fuzz by component level or library level fuzzers, just as you mentioned before. These were mainly written by Envoy contributors with deep understanding of how these components work. Some of the internal extensions are also fuzz either by an extension dedicated fuzzer or by the Uber filter fuzzer, which fuzzes the filters interface of any HTTP filter. These fuzzers are also part of the Envoy repository, so you can look it up on the Envoy repo in GitHub. There's also a third type of fuzzers, those that target extensions that are not part of the Envoy proxy repo. These fuzzers can be executed by the OSS fuzz infrastructure if the project that hosts them has integrated with the OSS fuzz. But how can a community help? Well, we're always looking for contributor contributions that improve the existing fuzzers performed. So our fuzzers will have better signal to noise ratio, run more iterations and cover more code blocks. Maintaining the fuzzing infrastructure is mainly done by some Envoy maintainers and Googlers, and we're looking for other parties that would like to participate in this effort. In addition, there are non-CVE fuzz bugs that are detected, and we're looking for people to assist in addressing them. Finally, we can benefit from more fuzzers that will cover more use cases and more code blocks in Envoy. Here are a few references to resources and guides that explain how to build and run fuzzers. First, there's the Google Fuzzing GitHub repo that contains tutorials and examples on how to write fuzzers, including some tips on improving performance. There's also the OSS fuzz documentation that is relevant to anyone that wants to take their own repo or own project and fuzz them. For Envoy specific fuzzing, we suggest looking at the test-less-spuzz directory in the Envoy repo, and the provided documentation that can be found there. If you'd like to see a very simple fuzzer that is non-proto-based, the Bay 64 fuzz test is a good starting point. A more interesting fuzzer is the Uber filter fuzzer, which is a generic HTTP filter fuzzer. Finally, as you mentioned, it is important to always consider the attack surface and the threat model when writing fuzzers. We would like to thank Harvey Tuch, Azraeli, and Matt Klein for all their efforts in setting up the Envoy fuzzing libraries and infrastructure. And of course, to the rest of the Envoy community for their contributions in developing new fuzzers, addressing bugs, and making Envoy more insecure and robust. In the future, we're looking into integrating the Envoy CI with Puzzler Fuzz Lite that executes fuzzers when a PR is submitted and attempts to find bugs earlier in the development process. We're also looking at how to improve the performance of different fuzzers in order to cover more inputs and increase their signal-to-noise ratio. We would also like to add more integration fuzzers and improve their performance. Finally, we're looking at how to improve the fuzzers code coverage and ensure that they're covered code blocks that are high-importance. Thank you. Thanks. What was the other question? Let's repeat the question. Your question was earlier at the beginning Matt Klein said that there are a lot of false positives in the fuzzing infrastructure, and you're wondering why and how we can avoid them or fix them. Okay, so this is what we call the signal-to-noise ratio. To address them, we need to understand which fuzzers we're talking about. For example, we're talking about the conflict plane, which conflict fuzzers, which the input space is very, very large. It's harder to just take some input and say, okay, let's run this. Also, if you look at the thread model, usually the conflict plane is considered to be rusted. So we're reducing the amount there. Detecting it, we know which fuzzers are very noisy. There are solutions in how to constrain the inputs, so there'll be more realistic inputs. We just need more people to work on this. And then there's also a lot of infrastructure flakiness with OSS fuzz. So anytime we change our build tooling, like Clang, or any of the compiler options, we need to make sure everything in OSS fuzz works as expected. Sometimes timeouts, if we don't set the correct timeouts for the fuzz tests, they can use a lot of resources that we don't expect it to, and that will cause flakiness bugs in OSS fuzz. So infrastructure things, just bad, like not having the right fuzz input schemas, things like that. Any other questions? Jason? The question is, do we have any statistics on how many open fuzz bugs there are? So there are currently about 100 open bugs that are not part of the threat model, mainly from conflict plan and stuff like that, just because of the noisy fuzzers that we have. I can say this, we did find recently some bugs that were introduced and could have caused a regrettable kind of errors downstream if we would have taken that into production. So whenever you fix these kinds of bugs, you don't see the excitement as in there was an issue and everything crashed, and then you're like, okay, I found the fixed word, right? This is before everything, you know, your system crashed. So you don't get the excitement of, you know, I fix the bugs in production because you just prevent the bug from going into production. That's the idea. So for this, there were over 1,000 bugs in the entire lifetime of the Android proxy, fuzzers, OSES, and integrations, there were more than 1,000, and at the moment there are about 100 that were not to be open. Yeah, and just going a little deeper into that, if you looked at our CVE list, if you looked closely at it, you'd notice all the CVEs we listed were 2019. So I assume the reason that happened is because that's when we first started adding in fuzzers and we caught all those bugs in production. Now that we have our fuzzers running, we're not catching any production issues, we're catching these bugs earlier and fixing them before they make it out to production. Well, we'll try.