 Welcome back everyone this is day two of the Linux security summit and we had a very full room here yesterday for those that are remote we had to add chairs at the back so that's really good to see the increased attendance and today the program will be similar in format to yesterday we have boff sessions at 4 10 p.m. and I think I'll just once again ask speakers to repeat questions rather than passing a microphone around so the first talk we'll just go straight into that this is a remote talk and it's about code-aware services in the service of vulnerability detection my name is Bartos Zatter I'm from Samsung Mobile Security Team and today I'd like to tell you about various tools that could improve the security assessment process of a large scale software system so a few words about myself I spent last 15 years in various jobs in mobile product development where I find out that I'm really passionate about creating tools that improves the productivity of software engineers and the software development process in general so a few years ago I moved to security team where and I took a task of making a life of security engineers easier and today I'd like to tell you about some results from this challenge and more specifically I will tell you about the code-aware services system in short CAS so this CAS is a system that provides insight into how a software product is made it can also be used to automate software source rather code-related operations so it's composed of two parts each part creating database of useful information so the first part is a built-awareness service and it creates a database of a built information and the second part is called the function type database and it creates a database of a code information extracted from original source files so what I'd like to show you today is to how it can be useful in a software engineering job especially during the automation of vulnerability detection okay so here we have a very short glimpse at the entire system I will talk about each part of this system in details in during the stop okay so let's now focus on the first part of the system the built-awareness service the CAS in short okay so what kind of problem are we facing here so imagine you need to do a security assessment of the mobile phone the company you work for actually produce this phone so you can access you have access to every source code that you need so we download the source code from the repository and the first surprise arises at this point because because the entire software stack that runs on today's smartphone is extremely large so for example for the latest AOSP x86 version when you download the sources from the repo you end up with around 1 million files of course you want to focus on the low level native part of the operating system mainly the CNC plus file because the large number of issues originates in the memory corruption problems and the CNC plus plus actually helps you very much in this regard so but we still have a more than 300,000 files to review so the complication is really huge we have a few thousand modules created during the build I mean libraries and executables and another problem is that you can build many distinct products from one source tree or a different variants of one product so actually only a subset of all the source files are actually used to build specific configuration and the final complication is that you have a pre processor definitions and of course inside the source files you have a conditional inclusion of sources based on these pre processor definitions so that means that actually different parts of one source file can be used for two different build variants so how could you know if the code you're looking at is even running right we want to avoid the worst nightmare of security engineer I call it almost CVE so you spend the entire day by reviewing some function you thought you find some error in the code so you create the POC you push it on the device you run it and only to just to find out that your code is not has not even been compiled okay so how could you know which parts of the source code are actually used in a given configuration so well the build process is able to create the final image that you can flash you can run and even works right so the order information that you need is actually embedded in the build process itself so what kind of information is that which source files are compiled exactly how do I compile what are the dependencies between build files information about custom tools executed during the build like for example some tools that actually creates some source files out to generate them etc so actually all we need to do is just to track the build and grab all the information that we want okay so now there are two questions what we need to trace and how could we do that so what do we need to trace so probably we need to trace all the files open during the build so we need to trace the open family of Cisco's we also need to trace all the executed processes which are the exactly Cisco's and it also helps to track the pipe Cisco's to catch which processes exchange information between each other using pipes and how could we are implementing the tracer so we have the LD preload tree but there are two problems with that we cannot track the static exit tables are used during the build so these parts will not get in a trace and another problem is that the LD preload tree can also be used by the build system itself so those two will now clash okay so another idea the ptrace and actually this works the first version of our tracer was based on a modify strace tool so the worst noticeable overhead like the build which normally took three hours could end up in around eight hours but actually we could live with that and we live with that for quite a long time until we have bought this big servers and we quickly find out that the more cores you use for the build under the strace the slower the build was so for 120 core machines are the build hardly even completed I suspect there is some synchronization of all Cisco's from all processors inside the ptrace system call I'm sure some of you cannot point out to me the exact reason why this happens but anyway we finally switch to the Linux kernel tracing infrastructure the ftrace so the tracer was implemented as a Linux kernel module so it tracks the required information inside the kernel and just write the data to the ftrace buffers so overhead actually depends on the what exactly the build system does but on average for the full AOSP build we had around five percent overhead in the build time so here's the bus BAS architecture it's quite simple and straightforward like the tracer tracks the build and saves raw Cisco information to a file is then post-processed and safe in a JSON file which can be easily accessed by applications of course it can be also be stored in a proper database if you like and applications can use it for any purpose they want I just want to mention two more things about the architecture like the first one is the sum of the executed processes have special meaning like for example compilation commons so these are analyzed further to extract additional information from it like for example a preprocessor the preprocessor definitions defined on the command line or internal compiler preprocessor definitions or include parts used by the compiler etc and there is also functionality of computing the file dependencies between the built files so for example for the VM Linux linked kernel executable we can get a list of files the VM Linux depends on so it is actually a list of the source files or header files that were used to create the VM Linux executable okay so let's see some examples how could this help us in everyday work so the first thing is that you could radically improve your code search utilities how to do it well you can get a detailed subset of files that were used to create the final product so or specific model like library or executable so imagine you have this checkbox in your code search tools that you can of course search in the code normally as always but when you mark this checkbox it says that look only in the file that were used to build the product or look only in the files that are dependencies of some specific model so for example in the latest ASP 3 as I mentioned before you download 1 million files but actually less than 200,000 were really used to build this product and similar for the common kernel like 80,000 files download less than 25% used actually to create the VM Linux and so you can see the examples of that searching for GPU string in the code search tools entire ASP there's around 10 times reduction in the number of results when you mark this checkbox and another example would be a improved IDE indexing so why would we want the IDE right but it has really nice features like searching for symbol references inside the IDE or some difficult macro expansions navigation to the source code easily etc so normally even if you take all the relevant files and just pull it to the IDE you still get thousands of errors like everything is marked in red and why does it happen well in order for the IDE to operate properly it needs exact compilation switches of all the commands executed compilation commands executed during the build so you can grab all the preprocessor definitions that were used so here is some example of indexing the VM Linux in the Eclipse CDT IDE on the left side we've just pulled all the source code from the Linux kernel into the IDE on the right side we have the custom generated project description files for the Eclipse using the build information and the left we've indexed a lot more sources that are really necessary and we also have much more unresolved symbols than on the right which is the right side is almost perfect in this regard so we can see radical improvement for the IDE indexing capabilities which translates into a much better experience when using the IDE okay next example imagine we have very complicated build system that takes few minutes to read all the make files be even before the proper build starts so the Android build system was like that in the past and now it's significantly better but I'm sure you can still find these kind of build systems in the wild so you want to make a partial incremental build for some selective functionality you're currently working on say a few source files from one model so with build information you can easily generate make file ninja build files knowing all the specific compilation command that we're used to build this module or module hierarchy but let's us take a look at some more concrete example of that sort so for example in order to perform a clang static analysis of some module you have to invoke a build of this module under the scan build tool it then extracts all the compiled files and performs static analysis on all of that and the problem arises when you want to make a selective static analysis for a certain files only say a few modules from a large build so with build information you can generate a proper make file with compile commands just for these files as in the presented example so which allows the fine-grained clang static analysis on a large software tree and finally you have a nice service that allows you to query build the information according to your needs like for example you're working on some issue in a specific version of a product and your infra have build information up for every release software version so you can consult it when necessary and let's take a look at some examples of the query that you would ask to this service like for example getting a list of compiled files with use a specific header file or getting a list of modules that depend on some specific source file getting a list of preprocessor definitions even either given by command line or internal compiler preprocessor definitions which were used by a given during a given compilation and another benefit is that you can write automation scripts in your infra that query the build information from the service and use it in some productive manner so let's now focus on the second part of the CES system called function type database the part which creates database of source information extracted from the original source files okay so now we have build information and perfectly indexed IDE so we can finally start our main job which is the security assessment of the mobile phone so let's think for a while what does a security engineer do when he performs security assessment of the Linux kernel or another model for that matter so the first thing to do is to locate the entry points what are the entry points oh it's on location in the source code where the data from the user flows in so it might be some Cisco interface it might be some network layer sockets or other hardware and the second thing to do is to verify the implemented source that implements getting the data and further processing the data because there there might be some errors in there and security engineer needs to find them so how the security engineer looks for these errors well he relies on his experience and tries to find some vulnerable patterns in the code like for example some buffer of the flows yeah but the problem is still the same there's a lot of source code to be reviewed like for the Linux kernel with all the accompanying models it can be easily more than 3,000 source files to review so what you want to do is to employ some kind of automation to help us in this task and one way to do it is to take the code and adapt it for fuzzing so it can automatically find crushes which can reveal some security problems but how to automate the security code review itself so there's this idea what if everyone that works with the code and have the willingness and capabilities to do that could write their their own tools that operate on source code relatively easy in any language they want so they could transform their own internal vulnerable pattern recognition mechanism into some more automated form that would also work at scale so what would we need to achieve that well it would be best if we have some parsed source code representation in a simple format that we could just explored easily so how to do it how can we extract features from the source code well maybe you can write some regular expressions but probably you can write very simple regular expressions until the task becomes unmanageable so maybe you can write our own parser yeah but to write the parser the first thing to do is to write the preprocessor because we want to parse the what don't want to parse the original code but the preprocessed code and this is also difficult task in itself but even if we achieve that we have to face the very complicated grammar of c and c plus plus so the reality is that you cannot write your own parser but hey there are some working compilers out there right so why don't we just use the parser from there and as it turns out there is something that we could use straight away a clunk frontend for a little VM and the beautiful thing about clunk is that it gives a collection of libraries to do a various things and it was designed that way just from the beginning so the clunk compiler is just a driver binary that combines functionalities from many different libraries and implements the parser and compiler so there is one interesting library leap clunk frontend which is an entry point up to the parser implementation and everyone can use it for their own purpose like it's possible to write a small application that uses this library to parse the source code and this library implements some AST unit class which has the function called parse surprisingly which takes the source code the source file and transform it into some parse representation of the source so what does it mean parse representation how does it look like so it's some sort of a tree called abstract syntax tree in other words it's just a code representation in a tree like form tree of nodes and it's an equivalent representation meaning the source code can be generated back from the AST form and each node in the tree is described as a c plus plus plus that implements some specific functionality of the source grammar so for example there is one node class the clever expression which implements are referencing some variable in a C like language so as I mentioned it's a tree so it's easy to walk of course you need to write the C plus plus code that will traverse the tree so for example if you want to find some all usages of some variable x you just need to traverse the tree find all the nodes with the type the clever expression and compare the name if the name is of the node is a call x then you you got it you found the usage of the variable x okay so here we have some very simple code and is equivalent representation in the abstract syntax tree and the problem is that we still need to write some custom C plus plus code to traverse the tree and extract the required information from it so how about we just write the code once this will walk that AST extract some interesting features and save it into the JSON file and this is how the function type database is actually created like we have this client processor that takes one source file extracts the predefined information from it and saves the resulting JSON file and you can do it in parallel for as many sources as you want of course you need to have enough RAM to do it to help you with that and the all the intermediate JSON files are then combined into one final JSON file and you can do it for actually for any combination of source files but of course the most sane way to do it is to do it on a module basis that's it that is one JSON file per one linked module and the rational is that the module was successfully create linked so probably the combination of functions and features is somehow reasonable so what do we extract from the sources so we extract some information about functions information about types global variables and some initialization of function pointer members of the global structure types and one important thing to notice is that only selected information is extracted from the from the code we don't save the entire AST because and even if we do selective extraction and saving into the JSON it still can get can get quite big like it's normal that for the VM Linux kernel executable the JSON size is around few gigabytes okay so let's take a look in more details of the data extracted from the source files and and here's some examples about the source code and some JSON representation so in case of functions we extract some name and the arguments of the function arguments meaning the name of the argument and the types of the argument we also extract some function attributes like whether this is a variadic function or this is inline function what is the linkage of the function source attributes of the function of body either preprocessed and the original body we have some literals like string literals or integer literals we also have chains on arguments and call information case of call information we save the list of all functions that were called inside the body of the function along with with the variables used in the argument expressions of the call so with that information we can just build the entire call hierarchy for a specific function we also save all the automatic variables created inside the function body together with the compact statements where it was created and we also have some all references to all their global variables all their functions and types inside the function and another thing we extract is some expression information like for example we grab information about the in-direction operator otherwise called the dereference operator and also some array expression and in that case we just save all the base variables used in that expressions and also the computed offset to the memory and similar fashion we save the information from the member variables like some base variable for the member expression and also all the member offset in the member expression chain similar to the offset of expression and finally we save the all the variables used in the if statement switch loop conditions and return okay so in case of types the information stored allows us to reconstruct the types afterwards and this is especially true for the structure types so what do we save is the information about members like the names of the members types of the members member offsets the full size and layout of the structure so even for very complex nested structures we can generate the type fully and can generate fully the type definition with all the dependencies just by reading the JSON file so of course we also suffered all the building types which is the mostly the size and enumerations like and on strings and and on values and similar fashion we suffered our right types type depth etc. and one interesting case is that initializes for function pointer members of global structure types so for example we have these file operations variables spread across the source code of the Linux kernel and the members of the structure are function pointers to a specific implementation for a given driver so for example we have you write ioctl and mmap handlers etc. so now whenever these handlers are initialized statically with a function name we just grab that name and store this information inside the JSON file so for example we can extract a list of functions that are actual implementations for various interfaces and some of them may actually be an entry point to the kernel which should be a review in the first place so I just show you some examples of the code and JSON representation and one important thing to notice about all of that is that if some information that we want is missing in the JSON the support can be easily added to the client processor to actually retrieve this data and the opposite is also true if there's too much information that we don't use and the JSON is too big we can just very simply remove it and then save it so currently the extracted data is predefined but I imagine the extraction can be implemented as customizable as possible by providing some scheme firewall etc. so how could we use the JSON data in some productive manner to support vulnerability detection so the first example could be a security code review support so imagine a system that could automatically extract a list of functions which implement the entry point to the kernel and present it into IDE like a manner so we could have the original codes on pre-process code addives with the previous release or previous model or other OS version taint information arguments etc. so generally things that make this critical review easier for the software engineer so you could apply some heuristics that looks for potentially vulnerable patterns in the code and sort the functions with the probability of error in mind some examples of the heuristics could be a usage of some dangerous functions on the code hierarchy and some cyclomatic complexity so the rationale is that the more complex the code the higher probability of error in there and maybe some memory usage patterns like the more the code shuffles around data across different buffers the more probable that there is some error along the way and it would be good if you could implement our own recipe for the entry point extraction so for example you have some driver that have many different implementations for different commands of the IOCTL and better deep down in the main IOCTL code so you could just extract all the implementations and mark them as an entry point for the review and also nice feature would be to integrate output from some other tools with your function view like info from fuzzers or static analyzers so you could have everything in one place to help you in the proper review and another potential usage could be a we could easily generate addiction errors for fuzzers like AFL by just grabbing the used strings on other literal values from the source code it's all in the JSON file so it's very easy to do that and one of the hot topics these days is a structure aware of fuzzing so what is this about so when you have some complex structure for which a fuzzer generates some data to fill it you can do a much better job when it knows the layout and the type of the members of this structure so in the example presented here we have a structure that contains a string enumeration and some floating point value so we could set a value fuzzers hard as for it and it takes some buffer and fills the instance of the structure and fuzz it and we run the fuzzing session and we get the crushing input reach in little above 100 million executions for the specific seat so can we do better than that so well we have this leap protobuf mutator that can take a protobuf description of the data and mutate it accordingly so if this protobuf describes our structure well the mutated data that fills this structure would be a much better sort and as it turns out it reaches the crush much faster it needs around three million executions to do that and okay the mutations are slower but there are much less of them needed to reach the crush so in the end it fuzzing is much faster this way so the problem is that the preparation of the protobuf descriptions and writing the function that actually unpacks the data and fills the structure requires some substantial manual work for each structure type so remember this is somehow contrived very simple example like and with detailed information type information from the JSON file we could generate initial protobuf descriptions and also the code of the unpacking function in the automatic fashion so this generated code might not be perfect like there might be some complications like for example in the structure some members of the structure can point out to other members of the same structure so it might still require some revision and fixing by hand but most of the painstaking jobs of preparing the protobuf descriptions for complex structures can now be automated and as a final example let's now discuss one of the applications that we developed which nicely presents the usage of the function type database and the application is called the auto of target in the AOT for short so let's have a look what kind of problem the AOT project solves so you could better understand how the JSON database can help us in this regard so imagine you need to test a piece of software that runs on a customs hardware which is very difficult to test and good example of that would be a mobile phone which runs a modern software which implements connecting to the cellular network and one of the functions inside a modern is probably a parsing messages incoming from the network so how would we test the parser inside a modern so maybe you could create a little mobile network connect our phone to it and feed the messages from this network to the modern we would force the modern to just run our parser function and to achieve that we need a set up a base station and get the software that implements the required network protocols etc so it seems quite scary right and expensive and even if we succeed in such a preparation the fuzzing would probably be very slow because we need to send the messages over the air and we have the natural limitation in that throughput so and what to do in case of crashes right it's hard to debug on the target like it's not that simple if you just take the run the modem and take the modem and run it under the gdb or they might not be sanitizers that would print us the beautiful messages with the location of a crash and I'm not saying that this is a bad approach right the whole system testing approach is a perfectly valid approach I'm just saying it might be very difficult to pull off so and so if you're looking for some alternatives ideas one of them would be to do some off-target testing so maybe you could extract the parser code compiled in a host machine the machine that you actually do most of the development might be some Linux and maybe other operating system and you just have some executable that you could test on your development machine and so we also could utilize some all the standard tools like the gdb we could embed some coverage information inside the executable we have fuzzers we have sanitizers we have valgrind or even symbolic execution etc and the parser is probably written in the pure C so it doesn't depend on the hardware very much so maybe it's a feasible approach to do okay so let us assume we want to extract the code for the off-target testing so we just copy the function and try to compile it and probably it fails because there are some missing dependencies so you just pull more code like more types other functions that were references more global the threat and repeat and you probably do it many times until it finally compiles and the problem is that for complex systems this process is not not scalable it's pretty common especially in the Linux kernel that for these dependencies I just mentioned might be really really huge like for example for some Linux kernel we can easily end up like pulling one third of the kernel code for some functions and this is where the code information comes in so the IOT project was developed to automate this task of extracting the off-target code so that's why it called that's why the project is called automatic off-target you can visit the github page of the tool for more information but the main point is the IOT project reads solely the JSON data from the function type database and operates on it exclusively as it has all the information that you need about types function globals etc and another example all is up to you right now so one thing I'd like to achieve with this talk is to just show you what we did maybe some get some feedback from the community and I'm pretty sure there's a lot of people who is much smarter than me who is now watching this talk so maybe you can come up with some other clever examples how to use it so to summarize all we've been talking until this point is just a software that runs on a mobile phone but there's a much much software out there like we have the other operating systems we have the car industry have games etc etc so for example in the Ubuntu 2004 there are more than 70,000 packages in the repository and it constitutes almost a 600 millions line of C and C++ code there so the tools presented here could be applied to the entire C code base ever created and so far the function type database works only for the C code but the C++ implementation is on its way so at least for the C-like parts of the C++ and finally here are some key takeaways from this presentation I'd like you to remember so these could be our software systems are extremely complex these days and CAS can help you navigate through this complexity and it could improve your tools and allows for more automation and this applies to all the software engineering jobs and especially the security verification of the low-level code and it can be applied to the entire software code base ever created and you are the one that could find a new ways of using it so thank you very much for your attention and if you have any questions I'd be glad to answer all of them so just reach out to me in any preferred form of your choosing and hope to talk to you again