 Hello, everybody. My name is Paul Gazillo, and I'm an assistant professor of computer science at the University of Central Florida Here in sunny Orlando I'm excited to be here today to talk to you about some of the work that my students and I have been doing on the Linux build system The Linux kernel has tons of configuration options. If you've ever had to configure your kernel You may have used make menu config When you run this menu config tool you're presented with this vast set of hierarchical menus That lets you set everything from what drivers you want to memory management options scheduling options USB support architecture specifics and This configurability is what enables the same kernel code base to be repurposed for an endless variety of Computing devices refrigerators cars supercomputers All without requiring a user to do any reprogramming of the kernel the configuration system takes care of specializing the kernel according to the selected configuration options But as you can imagine this configurability brings challenges to maintenance now for most users You may just use one distribution or have some canned configuration files you use But for those who need to maintain the kernel They need to make sure that all of these individual users configurations work correctly Now this is kind of a crazy problem because there's about 20,000 almost configuration options most of which are basically flags that can be turned on and off Technically many of these are our tri-state values which allow Linux kernel modules to be compiled But basically these options most of them can be turned on or off Because there are so many choices and they can be combined in virtually endless ways It's as if there's really trillions and trillions of different kernels combined into one code base And this is a challenge for maintainers who need to account for how patches affect all these different configurations of the kernel and How testers need to pick what configurations to use when they're testing new changes to the kernel Testing is even harder because you you have to actually recompile the kernel for each configuration that you want to test and It's not always straightforward to tell what configurations are affected by a given patch to the code base There's the the all yes config build option, but it doesn't actually include all of the code in the kernel And there's ran config which just can give you arbitrary configurations But this doesn't really provide any guarantees that you'll you'll hit the code that you're trying to trying to test and They're actually serious cases of unexpected interactions between configuration options documented in the literature Now many of these options don't actually interact with each other many control driver inclusion or and may have no combined effects with each other and Indeed the Linux developers maintain the K config configuration specification Which restricts what combinations of configuration options are even valid to build the kernel? but even with this specification the specification is is Thousands of lines long. There are still legal combinations that have subtle and difficult to track interactions with each other So a researcher Iago of all and his colleagues looked at the kernel patch history for cases of these kinds of interactions Causing problems for users and they showed that lurking in these untested combinations of options Is everything from null pointer errors and buffer overflows all the way to even simple compile time errors like undeclared functions What's interesting about these compile time errors is that this means there are our legal combinations according to the configuration Specification that no one has even tried to build before And so you might argue well, you know, who cares most configurations will never be used Most people rely on just some tiny fraction of what options are possible But without being able to predict what will be used Users bear the burden of dealing with seemingly valid configurations having build errors or kernel panics That would could normally be caught during development Even worse. They're actually even more pernicious examples of bad combinations of configuration options For instance, Yon Horn working for Google project zero found a serious Security vulnerability a privilege escalation bug due to the virtual memory area cash And what was interesting about this one is that the vulnerability only appeared in certain distributions of the of the Linux kernel So for instance the Ubuntu distributions had this vulnerability while Debian distributions did not even with the same version of the kernel And this is especially interesting because Ubuntu bases a lot of its package management on on Debian And the reason was that for this vulnerability It depended on a very specific combination of configurations options to be set. There were there were two in particular There was this config panic on oops, which just meant if some invariance that the developers had written into the code To Report an error some memory corruption error the kernel could could panic if that kind of memory invariant was violated But system administrators may not actually want the kernel to halt if there's a memory violation so they may have time to do debugging or graceful shutdown or save some data and memory and So even Yon points out that this is not necessarily an option that you want to keep on The other option was this security d-message restrict as a d-message is a log of kernel messages and This option can restrict user space access from accessing this log And it turns out that the vulnerability relies on looking at that log in order to find pointer addresses This VMA cache vulnerability is due to a use after free or a dangling pointer vulnerability and in order for and it has a very complicated series of steps in the vulnerability and One of those steps depends on finding the address of a particular memory structure in memory and D-message was the way that this vulnerability used to find that address and So finding such vulnerabilities is hard enough without having to consider all the possible configurations that the kernel has And I even speculate that that this kind of configuration related vulnerability might even be a vector for adding backdoors to open source software if you think about Open source software and and the one of the benefits of it Is that you have a lot of eyes and a lot of people looking at the software and trying it out making sure it's not broken or vulnerable and The code is open source So if somebody tries to add some obvious backdoor to code you would hope that with enough people looking at it It'll get caught but with these kinds of subtle interactions between configuration options even in distinct parts of the distinct subsystems or distinct components of a piece of software and potentially some very Devoted actor may be able to add multiple innocuous changes to seemingly unrelated part to the kernel or some other piece of software That only has a vulnerability when used together So the attacker just needs to get security analysts to use a different configuration from the targets of their attacks now This is just wild speculation, but I suspect that this might even be some potential vector I don't know any cases that this is actually used, but it's certainly theoretically could be So I'm not going to pretend to have a solution to all of these various challenges But I think there are a few first good steps on the path to tackling these kinds of maintenance challenges that Configurability brings and so a kernel developer Julia Lowell who's also a software engineering researcher posed to me this particular maintenance problem that Kernel maintainers have and that is when a maintainer gets a patch They need a configuration file to test it. I mean it seems obvious, right? And the question is can we do this automatically and that's because even if a submitter Creates a patch and submits some configuration file if they submit one at all There's no guarantee that that configuration actually exercises all of the patched code and Certainly not all of the various combinations of configurations that that patched code might touch And it turns out that tackling this problem is actually a really good first step to tackling a whole bunch of challenges with configuration related problems to maintenance and testing So for instance, we have this question of giving a patch. What configurations does it affect? But also given a bug say found by a fuzzer or a test suite What configurations does it appear in which configuration interactions are required to exercise that bug? Another question might be what is a minimal configuration that includes some specific source code? And perhaps even a dead code what code is no longer configurable in the kernel at all and we may want to remove some dead code So there are actually tools that tackle some of these specific problems. So Julia Lowell has a tool jmake which tries to find whether a patch is covered by all-yes config and it Can help find some configurations that'll test some patches Config bisect is a tool that's included in the kernel and that can help using a testing approach try to find some configuration that exercises exercises a bug An undertaker by tartler at all actually seeks to find dead code in the kernel And now all of these at least in part have some Part of their method where they use the search and test approach where they generate configurations build or pre-process them and Check whether the configuration that they've tried is the one that they want for the problem. They're trying to solve So for patches we could just generate random configurations until we get hit the patch Lines of code that the patch touches But my interest is in static analysis static program analysis because to me static analysis is almost Magical it can static analysis techniques when they when they work right can just take source code as input and figure out What that software does without ever having to even run it and even better? It can be sound it can be comprehensive it can find The behavior of the program for all possible inputs and so for this case One possible option is to use a static analysis approach and one of the reasons for that is I just enjoy these kinds of approaches So I think this is fun to do but some other potential benefits is that Static analysis approaches could be fast So instead of these tests and search kinds of approaches a static analysis if done efficiently could be fast And also these configuration spaces are gigantic. There are trillions and trillions of possible configurations And so if we can answer these kinds of questions without having to do a search through the space of configurations Then that might give us the kind of performance benefits that would allow these kinds of These kinds of tasks to be really efficient And we can also perhaps be more comprehensive or sound is what the theoretical term theoretical people like to use this term sound But in a sense, it's able to tell us for all possible inputs to a program or in this case all possible inputs to a build system Try to answer the question for all possible inputs So for a patch instead of finding one or two configurations or having to search through a test and search approach We might be able to say here are all the possible configurations that touch this patch And maybe we can't test all of them at least we can make some decision about which ones to try first So if we can pull off this static analysis approach We may be able to make automated tools that maintainers can use regularly and that provides some good strong guarantees So it turns out that all of these problems that I've brought up so far Finding a configuration that matches a patch finding dead code We can think of these as We can think of these as having one thing in common all of them try to take some source code in Some particular configuration of the kernel and map it back to the configurations that actually include that code And so I'm in academia. So we've got to have a nice long multi syllable name for this So I call this problem the configuration localization problem. That's that's ten syllables I can you know, I've got half a paper written with just using that term But you know in all seriousness my thinking is that if we can solve this problem and solve this problem fully automatically and comprehensively Then we'll be able to make some more automated tools like finding configurations for patches much more easily and make them automatic So this is really the crux of the work I'm doing in this area that if we can automate these configuration localization problem Then we can find automated tools for many of these problems Okay, so let's take a look at how the K build system works the Linux build system works and What configuration localization really means in terms of the kernel build system? so at its at the birds eye view the Linux configuration system is this Basically a big program that takes in your dot config file Which has all of the settings you want for your particular kernel and it produces your kernel binary But if we open up the hood a little bit in here We can see that there are three distinct phases to this build and configuration system the K config tool takes in that dot config file and enforces the Configuration constraints that developers have defined in these K config files There's the K build make files which take the dot config file settings as long as they're validated by K config and It uses them to decide which C files that should be compiled and linked into the final kernel binary. There's 20,000 source files in the kernel and Not all of them are going to be compiled and linked into the final kernel for every configuration In fact some smaller subset of them is going to be Compiled K build is the tool that makes that decision and it's based on it's basically based on make and Then there's the C compiler So this the first step of the C compiler is the C preprocessor and the C preprocessor actually also takes these configuration options settings as macros and uses them to decide which individual lines of source code in each side CC file to compile. So this is called conditional compilation in the compiler world and then finally Once all of we have all the code decided on which C files to Compile which lines of those files then the C preprocessor passes these on to the compiler and the make file calls the linker to produce our Kernel binary. So if we just remove out this part of the build system that does the Configuration step the stuff that actually chooses which source code to build based on the dot config file configuration settings Then we can we can view this build system as a kind of code generation Using some kind of metaprogramming like the C preprocessor make files These are a kind of metaprogramming in a sense that takes the config files as input and produces code as output Because once we have this once that we have the output of the C preprocessor Build compiling link building the kernel is basically just compiling and linking it There's there's really no more compile time configuration decisions to be made So configuration options do actually also affect the runtime behavior of software of the of the kernel But for my purposes, I'm just concerned with the compile time configuration So once we get to the C preprocessor, we're pretty much done with configuration compile time configuration. And so My conception of the build system is just these three steps of the build system And so if we want to ask a question like configuration localization Which which source files which source code maps back to what configurations? Then we can view it as the inverse of this build and configuration step That is once we have some source code locations that we're curious about either were given in a patch or a Fuzzer found us gave us a bug report The configuration localization is really just inverting this build and configuration process to tell us From source code locations the the output of the build system Tell us how we got there from what dot config files which configuration options Caused this source code to be built and compiled And so this is not necessarily so straightforward for a number of reasons for one It's almost like it's kind of analogous to a hash code It's very easy to compute the hash in the forward direction But trying to invert the hash is really hard now Hashes are also a giant state space problem probably even larger than than this problem But it's the same spirit. We're trying to reverse the effects of the build system is Is really expensive if we're trying to search through all the possible dot config options? and what makes this what makes this tricky is That each one of these phases of the build process Really has its own language and its own behavior in how it goes from dot config file settings into a Selection about which kernel variation that you want So this is an example of the C preprocessor doing conditional compilation in this example the highlighted Pound if def and pound and if these are actually not C language per se. This is the C preprocessor language And this code snippet means Only include the enclosed source code if this ufs debug Option is enabled by the user by the person building this build system And so this is an example of conditional compilation this code within this if def and end if block Only appears in kernels where the config UV at ufs debug option is enabled So if we want to figure out how we got there we need to understand how the C preprocessor Chooses source code based on configuration options Cable make files do conditional compilation of entire C files in this case these These source files here Be Alec dot-oh cylinder dot-oh. These are the object file names of these source files These are only built. Well, so first of all they're linked into this ufs dot-oh file But all of these files are only compiled and linked if the ufs fs Configuration option is enabled that is if you have Unix file system Turned on in your in your configuration if you've turned that on in menu config or however, you're configuring the kernel And So this syntax is a little weird I'll talk about this a little bit in a little bit later if you haven't seen this before but this is how In K build this is how developers specify that certain file should only be compiled and linked under certain configuration options and lastly the K config system Restricts when these options are allowed to be enabled it it encodes dependencies between these options so for instance in this case The ufs debug option can only be enabled if the ufs fs option is turned on and Even more the ufs fs option could only be turned on if the block option is turned on and there are thousands of lines of these dependencies described in K config and this provides a real Challenge to trying to figure out not only which configuration options lead to which code But which configurations are actually viable according to these constraints So the main crux of the solution that I'm approaching for this the static analysis approach to this is to first take each of these contributors to the build and configuration encoding in this build system and See how they contribute to the build ability of the software so they each make their own contribution to it But for configuration localization all we really care about is how the build ability is included We don't really care whether the how the linking works We don't care so much about the C code itself all we care about is whether these three phases how they influence the build ability of given source code So if we can reduce each of these phases to a Boolean formula That says true when the source is buildable or false when it isn't under some configuration option settings Then that actually captures exactly what we need to do configuration localization So imagine that we have these giant Boolean formulas for each phase that take in every configuration option and evaluate to true when a piece of source code is Buildable under that configuration and evaluate to false if a piece of source given piece of source code is not buildable so if we can do this then configuration localization is really just Generating these constraints as Boolean formulas and then doing Satisfiability so this problem of finding whether finding the solutions to a Boolean formula This is the classic Boolean satisfiability problem Classic in computer science for decades and it turns out that there's a whole lot of tooling that have been developed over the past couple of decades for my finding these kinds of solutions really fast doing satisfiability Really fast SAT solvers and SMT solvers and so now this configuration localization problem Which seemed like a giant search space problem is now reduced into just finding constraints from this build system for given source code and then just using a SAT solver off the shelf to find possible solutions to that find whether it is solvable and Use those solutions to generate dot-config files that should work to build that code When passed into the normal build system the normal Linux build system the main trick in really all static analysis is that they work by following both sides of all conditional branches and So I've taken this kind of approach and applied it to each of these phases of the build system each of these tools They each have their own encoding for the build process and each of them involves some kind of conditionals and so if we can Follow both sides of the conditionals and preserve the path conditions for all sides of all conditionals Then we can preserve this configuration information about each piece of source code that gets generated by the build system So there's three main tools that that we work on For each of these three phases of the build process for K config We have the K clause tool which can generate Boolean constraints from the K config language We have K max which can do static analysis of make files to get Boolean constraints for each of the C files that it builds And we have super C super C can get the pre-processor constraints So I'm just gonna briefly go over each of these very quickly So working backwards from the source file back to the dot config file Super C does this configuration preserving C Pre-processing and parsing and super C was actually the very first research project I worked on when I was in graduate school It's from like more than eight years ago at this point and it was actually built for a different purpose It was built for the problem of trying to parse all configurations of a of a C source file But it turns out in order to do the parsing we need to do pre-processing first of unpre-processed source and in the process of doing that Super C will actually collect symbolic Boolean constraints from the pre-processor And so I haven't quite integrated this yet into the configuration localization work But I predict that we can use this to help localize individual lines of source code and get patches Localized to particular dot there dot config files that build them And so in short super C does macro expansion and header inclusion like a regular pre-processor But it leaves pre-processor conditionals in place And it turns out that this has some subtle interactions with the rest of the pre-processor when you do this For instance, macros can be multiply defined to different contents to different definitions in different configurations Different under a different if deaths And then the pre-processor just collects all of this pre-processor condition information as symbolic Boolean formulas So if you're interested, there's a paper on this There's a website where you can actually download and try out this tool if you're interested the second tool is K max and K max can look at K build style make files and Collect the source file constraints that the cable make file says they should build under and it works by doing a static analysis of the make language itself for some subset that cable typically uses and It is able to find every possible path in make files and preserve those paths conditions as Boolean formulas Now there are some bugs. It only supports a subset of K build make file syntax So K bills have a pretty usually have a well-defined subset of make But in some cases you'll see K build files where they use a make target to build a file instead of putting into the List of objects which which get built now K config files are already a kind of constraint specification language But they're not necessarily written in pure Boolean logic So K clause takes K config constructs and turns them into their interprets their Meaning in terms of Boolean logic So K config will specify constraints in both Boolean formulas, but then they'll use Constructs like depends on and select to say how these options depend on each other K clause turns them into Boolean logic So for instance if a the configuration option a can't be turned on unless B is turned on Then this means that if a is turned on that implies that B must have been turned on so we can turn that into a Implies B and there's a bunch of subtlety around reverse dependencies. There's some Bad practices for reverse dependencies that are still allowed Choices and so we've had to actually open up the K config tool source and see how it interprets These different constructs and K config This is still work in progress that I'm working on with to my two new graduate students this year that I've gotten Yeah, JOO my first graduate students actually JOO and a net chip ill-deran and this tool is is Largely done. We haven't written a paper on it yet. We're working on that now But that tool is actually part of the same repository as K max And so now I'd like to give you a demo of some of the tooling that combines the Right now combines the K clause work for K config and the K max work for the K build make files so what we can do with K max or with K localizer we can give it a name of a particular source file in The source tree we can use the dot-o or the dot-c extension Okay, localizer will convert between it and what it does is it computes the K build constraints on the fly by by running the K max tool But the K clause formulas these take a little bit longer to compute so it uses a cached versions of these If I were to have to generate these on the fly So I'm in another window now where I'm generating these on the fly it would take about two to three minutes And that's maybe not so bad. It only has to be done once per Linux version The problem is that there are about 20 architectures and each one has its own K fit K config specific files that have to be generated for each architecture So if we want to generate all of these for all architectures would take about an hour to do this But it only has to be done once per Once per Linux version Not once per run of the tool. Okay, so once we have this will take a little while and we're working on making this more efficient Once we have that we're able to we're able to just use the cached version of those of those K clause constraints and that's what we see here. We see that those cached versions getting Getting used and it takes just a few seconds to check whether it's satisfiable or not to build This allowed a compilation unit under the x86 architecture and so I can Compile this I can compile this by running I first run old def config because we haven't supported emitting default values for the configuration options yet But he was old def config to get the default options And then when I go to build this allowed a compilation unit We can see here at the end of the build we get that file being built Now it turns out that This file is also buildable if we do make all yes config But not all compilation units are available under the all yes config Configuration For instance this squash FS decompressor multi is a is a is a mutually exclusive choice between other decompressor options in that particular file system and And doing more make all yes config if I try to build this compilation unit It'll take a second to build if I try to build it the make file Cable does not complain to me that there's no rule to build this to build this file so I can use K localizer to Figure out the constraints under which this file gets built and It'll tell me here that yes I found constraints that are satisfiable and then it'll generate an arbitrary configuration that is able to build this compilation unit So now if I run old def config this at the default values and I go to build this file now because of K localizers Generating configurations that satisfy the constraints. We should be able to get this built and we see here that now decompressor multi can be built So what about other architectures other architectures are trickier? So for instance If I try to build this If I try to build this ps3 driver ps3 disk driver Even with all yes config It'll go through a little it's a it's going to complain to me that there's no rule to make this to make this driver What K localizer K localizer can do something really simplistic It'll just look at each architecture one at a time and check to see if it can find some Some architecture in which the build constraints are satisfiable So we can see here that for x86 it says the constraints are unsatisfiable So then it just tries the architectures one at a time in this predefined order And the user can specify which architectures they want to try they can just try one they can try all they can change The order but right now K localizer does something very simplistic It just tries each architecture one at a time looking to see if it can find Some architecture in which that compilation unit is compilable And so here it found that power PC the constraints for power PC are Mesh with the constraints for building ps3 disk Now because I'm doing because I need cross compilation to actually be able to build this compilation unit I use the make dot cross tool super useful tool to automatically downloads any cross compilation tools you need And K localizer will just spit out the instructions you'd need to cross compile it using make cross And now if I go and try to build ps3 disk Hopefully this should actually compile for me and it's building and indeed we got ps3 disk Now K localizer is still work in progress It certainly breaks sometimes K clause formulas are not yet perfect I'm giving you examples where things work really well, but of course there are there are still bugs in this So that's my talk Thank you very much for watching. I hope you got something out of this I hope this might be useful to someone I am an academia But I really hope that some of these tools will actually have some benefit in industry That's why I'm really so excited to be here talking to the Linux developer community And so I have a website config tools org that name was apparently still available That has links to my website and some of these tools and you can find more information and download and try out these tools yourself I welcomed lots of feedback. These are still research prototypes. So please give me feedback make github issues and Hopefully we can make these tools useful for maintainers. Thank you everybody