 So can we start now? Yes. Okay. All right. So let's start. The topic today for me is something a bit less travel, I guess, which is debugging a build. I don't know if some of you ever wish you could do like GDB make something. Probably doesn't work. And even if it worked just for make, build are usually a bit more complicated than just running make. So quick thing about me, I'm on a mission to make it easy to reuse free and Libre and open source software. I maintain or contribute to many projects. I've been a health helper a bit at some time on a stress contribute to be on the Linux kernel. Mostly to bug people about licensing issues. I'm a co-founder of FSPDX. I used to be a committer on Xclips and Gboss in the past. Most of what I code in is Python C++ and I try to stay away from JavaScript, but I'm forcing to it now and then. So why should you care about understanding your build? There's several reasons. First and foremost is you want to have a fine grain understanding of what gets in your binary. And you would say, well, of course I know that's my code. I know exactly what gets build. And it holds up pretty much until you have more than one C and header file. Now think about something like building a full Android device. That means a whole stack from the kernel app. This build typically involves multiple language compilers, some pure build binaries, scripts. There's about 400,000 steps which are executed when you build another device roughly. That gives you a bit of an idea of the complexity of the problem. And at that scale, this is no longer fully deterministic. Now, even if you don't do system programming but do application level programming, you create a note package talking about JavaScript and you say NPM install. And well, guess what? You have now 200 or 300 dependencies that were installed from left and right. And at some level, large build are a bit like magic. They work, eventually they do, most of the time they do, but it's really hard to understand exactly what happens, especially like in this case of NPM or Maven or application and language specific package managers. Things are fetched and provisioned at build time. So they come from the network. I remember a long time ago, actually a small JavaScript package, a Java package called Rhino to the JavaScript interpreter written in Java. And part of its build script was actually doing a W get to fetch some code, integrate it in the build, delete the source, and basically you ended up with some binaries. You didn't, you had no idea where they were coming from. Eventually they were just popping up out of thin air unless you could trace exactly what was happening. There's other applications which is eventually we are all redistributing vulnerable code. We're integrating vulnerable package. There may be licensing issues where things like linking dynamically versus statically when it comes to the LGPL, GPL have their own importance. And this is the kind of things that you want to, which are reason why you may want to care. So now let me state the problem is given some binary of package, I want to know where they're coming from, and eventually which known open source package they're built from. I want also to trace exactly what's the complete corresponding source code, which is more of a legalistic term that applies to the GPL. And I want to do that eventually in the large, for large complex build like whole device, or in the small, which is I'm targeting a single binaries and I need to understand exactly what gets into it. That makes sense so far is it something that the problem that you guys have had at times? More or less? Okay. It's at some level it's a pretty narrow use case. Most of the time we hope the build runs and we don't have problem with the builds and we don't want to touch it. Okay, so now the techniques to get there, there's many ways, but the only technique I will talk today about is what I call dynamic forward build tracing. That means you have to run the build and you execute your build with some tracing magic, we'll come to it in a second, such that you eventually can create from this trace a graph of your processes, executables and fight transformation at full depth and full complexity that you can then query to understand, well, who is this source, which binary is it used in? Or is this binary, which are the source that we're used to actually make the binary? There's many other techniques and they're off the peak, but let me cite them just so we can put them out. Anything that would be instrumenting the build tools or the compilers, off limit. The techniques I'm talking here are not requiring any change whatsoever to your build environment. Actually it doesn't care about what build environment you could use, make, see make, build a node package, run some script, it doesn't matter. Anything that's static analysis, that means you start from the binary, for instance, collect dwarf debug symbols, which is a technique used by GDB, to point back to source, it's out of it. Anything would be tracing the runtime execution itself also. We're strictly talking about the build, though eventually the same tool could have applications for dynamic tracing of runtime execution of code, that's not primary purpose. Other things, especially disassembly emulation, reusing compiler convention, for instance a .class file in Java typically comes from the corresponding .java file with the same path or same subpath. So none of that, symbols and debug symbols also are out of the scope. Actually the technique here doesn't care about symbols and debugging symbols at all. Now these techniques are interesting, especially the last few ones are something I'm working on. I said in my app store that I was actually presenting that to at the end, but I'll come to that in a second. So the problem, static analysis is not an exact science, it's difficult. Instrumentation is, for instance, if you were to instrument GCC or MAKE, there's been several attempts, there's new attempts in that space, it's complicated because you eventually have to instrument each and every build. It's very dependent on the internals of the code most of the time. So both dynamic and static analysis are complicated. Static analysis, in the best case you would have all the symbols that exist in the binaries that you can trace unambiguously to your source, but that's not the case at scale in practice. And also builds, when they're large enough, they're complicated. There are few people in a team that usually understand them and few people that are able to make sense of them in a way that they can evolve them efficiently. So just saying, oh, if you mind just making sure that you could GCC, compile minus G, everything, it's usually not trivial. And you turn it in one place but it happens that there's this other executable that's built in another way and it won't be built in debug mode. So in practice I found it extremely difficult to obtain proper debug artifact where you could do only static analysis to trace things back. The other thing is that it's pretty hard to conclude that something is not built. So you have a bunch of code, check out. Let's assume you have everything and you're not fetching anything remotely over the network. How do you know which part of the code is not built? It's not completely evident. The other thing that's important to keep in mind is that building is not the same as deploying. You may, for instance, part of the build executing test and this test may not be part of your actual deployed runtime binaries. The same applies with tools and in some cases bits of the tool chain end up in your deployment but some others don't. So any question, the context is clear for everyone so far? Okay, so the idea solution should be very easy on us requiring absolutely no change to the build and configuration. It should be 100% accurate and allow me to really understand everything about the build. And so what's the approach? Syscalls. I'm assuming a lot of you are understanding what a syscall is but for those that don't, a simple way to think about system calls is that's the matching language of the kernel. The Linux kernel in particular, kernels in general. And the kernel typically doesn't know much about what's happening above. It knows about file system, network, CPU, scheduling and simple thing it does, it's being asked to open a file, close a file, read some bytes, write some bytes, spawn a process, that kind of thing. It's a very low level and that's why it's a good idea to think about it in terms of the current machine language for user space. Now the other thing about system calls at least in the context of Linux is that everything you do in user space ends up as a system called in the kernel. It's 100% accurate. You know everything that happens but you're looking from below. The only thing you see is open file, read file, read bytes, close file, these kind of operations, very low level but you see everything. So the tool is called trace code and the approach is fairly simple. You run your build nurse trace which is a tool that you have the maintainer in this room, Mitri, I'm not worthy. And a trace is a system called tracer for Linux. And the way you run it is that you prefix the command that you want to execute with a trace, eventually some extra flags and it will collect the trace of everything that happens in the kernel. So it basically like a disassembly of the machine language of the kernel that you get for whatever you started in user space. In my case and it's probably not the best case, I'm trusting every system call. Which is huge. Like think about the full stack on the root build, 400,000 operations. You're talking about a trace that's in the range of 20 gigabytes. So typically bigger than the executables you're building and probably bigger than every artifacts and intermediate temp files of the build that were created. And that probably could be optimized but it's actually simpler just to trace everything. Once you have this trace, we process the trace and we rebuild a directed graph of five transformations that happen over a given tool executable. So again, the system call levels, read, write, open, close socket, file descriptors. And this happened in the context of a process. The process is a context of an executable. And you're basically saying, okay, I have a process. There's some file operations that take place there. So I collect them here. And you trace the life cycle of the files over the processes and executables. Once you have that in a graph model, you can create whichever way you want. You say, okay, I do topological sort and from that source file, tell me what is the last file in my graph that's never been read anymore. It's only been written to. It's not read. And what happens automatically in some of the simplest case is that the thing that's only written to and never read, it's at the end of the graph, is actually your binary. On the other hand, if you take some binary and say, give me some file at the end of the graph the other way going in the reverse way that's only read and never written, that in most case is a source file. So the cool thing about that is it's completely agnostic with regards to the compiler, the make tools. You use the build chain. It doesn't care. The programming language. And it doesn't require any change to your build process. And that's really what makes it useful in a fairly large array of use case. The only thing you need is to run your builder's trace and run it on Linux. There are ways to collect the same kind of system trace on Mac with retrace and with some magical invocation on Windows. But that's none of my interest. But I know it's possible. I've even seen implementations from some testing tool using Chromium that actually has a similar approach. And they're trying to isolate the runs of tests. So they know exactly which files are used to run a single unit test so they can paralyze that more efficiently. And they use a similar technique to collect during a test run which files are being touched. And they've implemented also a retrace tracer and a Windows-based tracer. And the cool thing is that it really provides this 2020 vision into the build process, at least some aspect of the build process which is what tools were used, what executables were spawned, what files were read and written to and in which order. So let's look a bit at the function. So if it's a graph, so you can pass it to dots and graph these and you can actually build cool things about it. I'm going to try something very simple first which is something called patch-elf. It's a single C file, a mini-tool created for a distro called Nex which just fiddles with the health format internals to do some relocation and things like that. And this is an example of a build graph from patch-elf. So let's work through that. We won't work through more complex ones afterwards but let's work through this one. So at the top I see here collect two, some executable from GCC46 with some PID at some timestamp that's what this box says. It actually reads two temp files. I don't know why but it does. I don't know who created these temp files and this is a dead-end in terms of graph. It doesn't go anywhere else so I can ignore that. Now more interestingly I can see my make file that's being read by make. That makes sense, right? That's what the kind of things we would expect, of course. But here at that stage, maybe it's a filtered graph actually in this case, I don't remember exactly but this doesn't go anywhere else. We see also that bash was running under the control, well, make was running in shell mode so under the control of bash. So this is a small branch of the dead graph. Here we see another interesting thing where there's a .depth directory with .po files together with a make file which are read again by make. There's some invocation of g++, no idea what it's for, maybe they're doing some configuration. This graph has been a bit filtered and we'll come back to the filtering afterwards. Now the interesting stuff here, actually let me use that pointer after all. First step, we're reading a .air file. Then I'm reading my patch health.cc. A few standard includes provided by the tool chain and eventually this is all processed by a tool called CC1+. Which frankly until I used trace code, I had no idea GCC was actually not some kind of monolithic stuff but many tools. This is the C++ preprocessor that transforms CCode in assembly. It also creates as a side effect some patch health TPO file which is some internal file, whatever it is. On the bottom branch here, it invokes the new assembler which eventually produce a .object file and finally invoke the linker, LD, to produce eventually my patch health executable. It's actually pretty surprising when you see the complexity and the number of steps things are going through for a single CC file with one include. There's also stuff that are going at the top which is eventually dead branch ending in make, processing some dependency files. So there's a lot of this PO, temp and dependency files which are created at various places during the typical new tool chain build. So now, as I said, if I say patch health, tell me what are the source? Well, if you query the graph backwards, the source are this which are the things which are at the left end of a topological sort which have only been read and never been written to. Now, if we do a bit more complex things. Did you filter out all the system libraries which are getting linked or is it really an example with no .o files from the compiler, no Lipsy, no nothing? I've probably filtered some of it. So some of the options. Yeah. Yeah, yeah, yeah. In this case, I probably, go ahead. Dependency of the binary, executable on the make file, there must be a better process. In some case, that usually depends. So there's a bunch of options to do filtering. And one of the things you don't care too much about in many case would be to filter out make files from your graph. If you're concerned about the source file, if you're concerned only about the make file, you won't keep that. So this is actually a build from bash. So that starts to be a bit more complicated. To the point where it's exceeding the capacity of some of my Linux distro PDF viewer. How do you generate this graph? So how you generate this graph is you have one option called graphic, sub command, threshold command. Once you've done your processing, you basically say, here are my traces, build me a graph. If you have one kind of edge in the graph or other two different kinds, if you have an edge where process takes one file and produces a different file, but also then you have edges in the graph where you have a process that creates child processes. Yeah. And that's why you have these isolated factors. So, the thing is that we trace process and their sub-processes. We create process tree. And the process tree gets eventually squashed. If there's no interesting operation that takes place at an ancestor. So that's why in the graph of... In the graph, if I was looking at the graph of patch health, here we conflate, for instance, the fact there's bash and MV in one process. This is actually the child process and it conflates the executable from the parent process. We didn't do anything interesting, just spawn the sub-process. How would pipes be shown in your graph? Excuse me? How would pipes be shown in your graph? I mean one program passing data to another program but never going over files. Well, that's interesting because I've stumbled on some builds. I saw that early on, GCC, for instance, was always piping, for instance, CC files and CC1 plus 5 to assembly. But that's not the case. I found actually builds where, effectively, you had no .s files being created at all. They were just piped. Whatever magic configuration was used, I don't know. But it's possible. You have one process that gets in the file. It outputs to the character device. The next one reads from the character device. Yeah. Well, you still have socket operations in this case. You still have, sorry, file descriptor operations that take place. Even when you're just piping, you still have file descriptors. And actually, we're not really tracing files, but we're tracing file descriptors. So, therefore, there are probably edge cases where this doesn't hold 100%. But in practice, and that's some of the things I'll come back to, it's actually a bit hard at times to reverse engineer what happens at the descriptor level, try to make sense of what this means from a user space perspective. There's not always 100% one-to-one matching. Another case that's interesting is, say you have a compiler that you can pipe files to. But these files are completely unrelated. Each will be built to something which is completely unrelated to the previous one. That's a use case that happens to. And the trick here, which is not 100% implemented, is to track the lifecycle of file descriptors and to demultiplex when a given executable looks like it's processing several files that are related, or it's processing several files to write several outputs that are completely unrelated. There's quirks, of course. It's not perfect in all respects, but there's ways to handle this. So, bash, you see on the left, that starts to be a bit messy. Interesting thing, by the way, in the graph, if you have a configure that's executed at some point of time, you'll see your config.h and you'll see how and where and when it's used. If you were to look here, actually, I probably have a better... I think Ocular is able to better zoom on this large. Yes, much better. So, that's another build that cups. It will render, eventually. So you see how files on the left, a bunch of c and .h, ends up being multiplexed through multiple processes. As a side note, by the way, in most standard make, new make build, there's something completely crazy that's going on, which is the presence of RCS files and a lot of older version control things are checked over and over again. And we're talking, like, eventually hundreds of thousands or millions and millions of times on a large build, which could represent roughly 10, 20% of the time spent in building, just checking for the existence of non-present files. I'm sure there's folks that understand much better make simple flags to pass to avoid that kind of behavior, but I've seen it in practice on large builds and it's surprising in some cases how some baggage gets carried over and just impacts every build by default unless you know about it. So, in the case of cups, which is a printing tool for Linux, we have a bunch of cc files. We have, again, assembly that are created, the assembler that's invoked over and over again. Eventually, if we go to the far, far the right, we have a bunch of dot-o files and we have our magic invocations to linker here, LD. And so, in this case, I have something which is probably HPGL, some kind of executable for HPGL printers. I have a shared object created here and I have some weird loops that may happen. I have a bunch of dot-as, static libraries, archives which are created more executable, so that starts to be a bit more involved. At some scale, eventually building a graph unless you have a very large machine, there is many CPUs and a very large printer and big screen doesn't make sense anymore, but it's very handy for reasonably sized smaller build. Actually, you can even finally enough trace the build of S-Trace on the S-Trace, which has a nice recursive build to it. So let's... it needs ocular too. So, if we zoom here, we should see a single S-Trace executable that's created. Yeah, it's there on the left and it's nicely multiplexed in a big LD invocation with a lot of dot-os. So it's a pretty straightforward build not much to say there. We have some assembly files which are created which in this case are actually... it's interesting... well, there's always some assembly file that's created usually in the GCC build at some step either from the CC pre-processing or the C-processing itself because we don't have C++ yet in S-Trace, probably not for a long while. Alright, so let's go back to the... where we were here. So it was showing some of the outputs. So it's a common line tool, pretty straightforward. It comes with some help. It's written in Python, primarily. A lot of the heavy lifting is about parsing the trace. There's been some effort in S-Trace to make the trace easier to parse and eventually that kind of tool could go away in the future, all the trace parsing but it still requires a bit of work there and a lot of options and help. The general steps is you run your trace and you want to apply some filtering. For instance, you may not be interested for anything that comes from the tool chain or some system includes. And then you can do inventories of all the reads and writes or graphing or queries of the relationships from source to binary or from any source to any binary. And so the internal model is at the bottom there's a bunch of processes which are related together with PIDs. They hold a list of reads and writes a list of executables that were spawned on forked and their children and reads and writes are grouped in operations where you have a process that treats some source well, we don't know if it's our source code but treats some source file and writes to some target at some time. There's some specific which are atomic read-write like a rename where you have a single system calls that's actually reads and writes at the same time and then the notion of an executable. And the hierarchy then I've written the hierarchy bottom up but that's a hierarchy generally speaking and then you have relations between each of these. Now the complication, parting a trace I alluded to it the thing that's difficult also is tracing the life of a file descriptor because file descriptors have a number they get reused they get a path and you want to trace the absolute path but they get reused over time and right now in the context I was alluding earlier which is long running executable which processes a lot of files which are not related the demultiplexing of these operations is not working great and what I think about doing is integrating a timeline and tracing for each file descriptor their life cycle in terms of nanoseconds from where they're alive to when they're closed and whenever file descriptor shows up eventually with the same file path or the same file descriptor number then it's considered a new one if it died it was closed before so the other thing is anything that's jammed and non-interesting but actually I lied when I told you it was really completely agnostic I didn't need to know anything about the built system that's not entirely true in practice if you have a large graph you may have a lot of junk and you want to do some filtering and this filtering requires to understand what's taking place so say you don't care about all the .po, .tpo and all these stem files created by make you may want to filter this so the facility provided there is to collect inventories of these reads and writes group them and then apply filtering command line invocations so you can prune these from your graph these tend to be fairly repetitive so say you want to ignore anything in slash TMP and what I'm seeking there to make it more efficient is to build command profiles for typical built environments where there's a set of reg X patterns to include or exclude systematically certain files as an option so that it's simpler to do the simpler things it's a graph, it's a directive graph so you can query both sides but it's excuse me you filter at a process level so you can say I want to filter a write that meets this pattern or a read that meets this pattern or I want to always include a read or write that meets this pattern and you can put patterns in a list or just enter a long command line so one of the original use case when I built that tool was I was actually doing an analysis of the code base of large git repository a hoster that I will not name and there was one problematic executable which was native code and we didn't knew exactly what was the source code that was used to build it so one of the application is to understand for instance static dynamic linking with other code especially copy def code but another thing is and it's just one of my start project is if you know exactly what the source code you have and you're distributing do you know if there's any software package that are vulnerable using it so that can be an application once you know finally what's the resulting license if you're combining GPL and non-GPL code in one executable eventually you can conclude reasonably safely that your resulting binary should be made available under GPL GPL compatible license this information of vulnerability and licensing is not always easy to access to on every software packages and I'm starting a small project on the side with others to help every fourth project manager to provide more clarity in that domain so tools it's on github it's written by Python Apache license and as I said next step is a few things by descriptive timeline separate tool that would work on static analysis and static reversing so same approach using symbols and debug symbols which is both more complicated and simpler in the end it should fade into the same graph that you should be able to create the same way or more fuzzy approach using signatures on the binary side and again I lied I said that I would provide also an extension that does the same thing on the static side but I've not finished it I started to work on the ELF, MACCO and WinP parsers to extract symbols and debug symbols but it's still working progress at this stage so credits so Astrace really works if you've not used it you should use it use it now use it every day and you'll be really having a happier life because of that and you can thank Mitri who is in the room here and many other contributors it's actually one of the longest running very active open source project it predates Linux by a good margin in terms of ancestry and I didn't invent anything about that I tried to model something that implemented the paper made by folks much smarter than me which is a little non-paper that I encourage you to read which didn't come with an open source implementation unfortunately these are the guys like Sander van der Berg Alcodestra which are who are behind the next distro so they know a little bit of a few things about Linux in general and that's it, thank you very much so I think we have one minute for questions two oh we have ten minutes so that's great yes it's from experience about 20% so it's very acceptable one of the largest so some of the largest run I made were on full stack on great device and using a beefy desktop IBM desktop with two IN no four quad cores each with 32 cores 32 threads with minus J32 the build was taking one minute as opposed to maybe 45 to an hour so it's really acceptable in practice because you don't always need to dive at that level of detail hopefully in your build the time when you want to do that kind of deep diving it's not deterrent now do you want to run all your builds on the restrace? probably not so the question is how big are the trace files? big fucking big the point is that Strace is trying to capture in something that makes sense for you men and that's eventually possible everything that happens at the kernel level and if you trace everything you know there's a lot of weird stuff that happened in the kernel like calling Intel home to make sure everything is fine with meldon and spectre and other stuff then it isn't practical for this to run the official is it practical? for this to run all their builds so is it practical for this to run all their build on? I guess yes on the adult basis not on the regular basis another question that's related for instance is would there be any application for reproducible builds which is an interesting topic it has some because it helps you get a finer understanding of what's happening but the difficulty is that because I lied it's not 100% build agnostic you need to understand a bit about what's cooking in terms of the lower level fine interactions to make full use of it so yes and no yes can you speak louder please so as part of filtering you can also filter executable out of your process graph and when you remove a node this way what happens is that I rebuild the link between the nodes on both sides so it's as if in this case if you were to print out the LD executable as if the LD didn't exist and you go straight from CC1 to an executable eventually which can be a bit weird but that could be eventually happening which is a default by default you would see the link process without it's always there and you will see as many invocations of the build and the linker calls as there are calls so if you're building a lot of shell object and executables or kernel modules in the end you will have a lot of LD invocations usually LD is not a big problem in the graph it's more all the intermediate step of compilation yes so the question is could it be used to figure out if a build is non deterministic like when it calls a system clock in it's current form no because I don't care about anything that doesn't touch a file now the code passes the trace and just ignores syncing the code passes the trace and just ignores things that don't touch files or sockets you could take the same code and enhance it a bit to trace that kind of things alright thank you very much and I'll be outside if you want to take