 So, hi. Welcome to my talk on mining for bugs with graph database queries. It's been a while since I last talked at CCC, and I actually planned on submitting every year, but it then turned out last time that I submitted that it's actually been five years already. You probably, so most of the people in the room probably did not see the last talk, but to make it short, it was mostly about hard-earned bugs, meaning staring at code for a very, very long time, trying to find bugs and exploiting those bugs in the code. And since then, my interest has been in making this process just a little less painful. And what I'm presenting today is the results of that. So, I want to give you the big picture first of what I actually want to achieve on the long run. So, what I'm trying to do here is I'm trying to combine two, well, my main two interests, which are on the one side, the kinds of things that we see at conferences like CCC a lot, which is bug hunting, exploiting bugs, things that are nicely represented by the books that you see there on the top. So, the shell coders handbook, the bug hunters diary, the art of software security assessment, and in the back, you see a frack shining over all of this, right? And then it becomes a little more academic with principles of program analysis. And with these kinds of methods that are proposed in principles of program analysis, have in common is that they are very, very exact. Now, my other interest is in pattern recognition and machine learning, these kinds of methods. And they are actually very inexact. So, they approach problems more from the engineering kind of perspective. And the idea now is, can we take the tools that we have down there, the pattern recognition tools, and kind of build tools to recognize bugs? You know, looking at bugs, looking at code similar in the way a person auditing code would look at it, and then just, you know, do an inexact analysis, but which scales very well. Because the problems that you have with these exact methods is that scaling them to really large code bases, like the Linux kernel, for example, is really, really hard. Now, doing this, I want to have tools which are realistic, you know, not those static analysis tools that give you, like, thousands of fits, and you can't really tune them or do anything like that. But instead, I want something to assist auditors in their daily work. Now, zooming in a little, what you're going to be seeing today is I'm trying to look at two very different topics and showing that they actually have something in common, that they actually fit nicely together. So one of them is good old computer science compiler construction, and the other is the shiny and new graph databases, or as some people like to say, big data, right? And we're going to see that they actually fit nicely together. That's what I want to achieve in this talk. Now, to start off with, let's take a look at an example bug. So this is a bug found by Stefan Esser and presented in his CISCAN 13 talk. And, well, in his talk, he presented a lot of different bugs, and this one is certainly not one of those, well, this is more of a site finding, I would say. But it's interesting for us, as you'll see in a moment. So the one thing that's already interesting is that you see these kinds of bugs like these all over the place. So it's like a, it's not a very unique bug, it's something that you will see again and again and again. Now, as you see, there's this variable, 32-bit wide called name len, and it's produced by taking this pointer data and then converting from network byte to host byte order. So that kind of tells you this data is probably attacker controlled because it comes from a network and you would not be converting from network to host byte order if it didn't come from a network. Now, what happens next is that there's an allocation. So this buffer called exit signal is being allocated and they use this variable name len that has just been initialized by the attacker and they add one inside the allocation. And clearly the problem is if name len is the maximum size of an unsigned integer, then this will overflow and actually something close to zero bytes will be allocated. And now finally, there's mem copy operation and we copy into this buffer which is close to zero bytes and we copy name len bytes into there, so that's about four gigabytes. So there's a typical heap-based buffer overflow. Now, what's interesting is how we found this and that's why I took this example. So in academic research, there are a lot of different methods being proposed on how you could find this kind of stuff and they all sound very advanced. So you hear things about white box fuzzers enhanced with symbolic execution and the machine learning powered animal detectors, maybe theory improving or model checking, but in practice he used a regular expression for grep. That was pretty amusing. So with all the advanced methods that you could use, he chose to do this. Now, I think this tells us a lot. So first of all, I think it tells us that if you use it right, even primitive tools like grep are very, very powerful. You just need to make sure that the person auditing the code can actually introduce his knowledge into the process. Now, the second thing that becomes clear is that these kind of queries, so things like this grep regular expression, they actually encode knowledge. So I'm going to go back one slide. Yeah, if you look at this regular expression, you see what it tries to match. So it tries to look for an allocation and inside there, there's some sort of arithmetic operation, so plus, minus, something like that. And in a way, you could say, this is a model for different kinds of bugs where he's saying these kinds of things happen all over the place, so let's just search for these. And then finally, it kind of shows that false positive rates for bug hunting, that's not so much of a problem. If you have like 50 hits and in there, there are 70 days, well, then that's fine. You read a bit of extra code, but who cares? I mean, yeah. So that gave me the idea of, well, maybe we can build like a search engine for source code that can be used to find bugs. And that's what this talk is about. So overall, it's supposed to look like this. You take source code, and you stuff it into some robust parser for the language, and then you import it into a database. And then the analyst, the auditor, sits at one side and is able to query the database, see what comes back, and then eventually adapt the query to make, well, to actually find what he's interested in. Now, in the back, you can also use this database for different kind of pattern recognition tools, but this is something that we're not going to be discussing today. Now, the prime question is, if we want to build a good search engine, what does our language need to be able to model? And overall, I think it's the following. We need to ask, what does the statement look like? And can we get from one statement to another? And how do the statements affect each other? And we can, of course, take a look at how people who design compilers have tackled all of these problems, since when you try to write an optimizer or something like that, you need all this kind of information. And in compiler construction, well, they essentially have different graph representations of code. And all each of these highlight some aspect. So, for example, there's the abstract syntax tree that you see here in the top, the AST. And this gives you like a hierarchical decomposition of the code into all of its language elements. Then there's the control flow graph. You probably know this. That's essentially what you see in tools like Ida Pro, which tells you, well, the statements have been collapsed, right? And you see how you can get from one statement to the other. And on condition nodes, you see whether the condition needs to be true or false to get to another statement. And then this thing here, this is not so well known, this is called the PDG, the program dependence graph, and this has edges to model data flow. So this says a variable that is being produced at one statement reaches another statement, so it's not changed on the way. So there's actually data flow from here to there. Now, as I said, they all highlight some aspect of the code that we want to be able to model in a search query, but none of them can really do it all, right? And another problem is if you take a look at a typical query, then it's going to sound a bit like this. Find a call to foo in a statement where data flow exists to a statement that contains a call to bar. That means what you're actually doing here is you're kind of transitioning from one representation to the other, right? So first you're in the syntax world, first you're saying find a call to foo, which means, well, in the syntax tree, there needs to be a node which is call to foo, but then you want to take a look at data flow. Now you want to see, can I get from this statement to another statement? And here the abstract syntax tree fails completely, and instead you want to use something like the PDG. So you want to transition from the AST to the PDG and then again to the AST. And what we wanted to have is a representation that can do all of these things. And the primary observation here is that in all of these representations, the CFG, the PDG and the AST, for each of the statements, there's actually one designated node, right? So if you look at this index equal source, there's one node in the CFG, there's another in the PDG, and also in the AST. Here it's just a declaration node. Of course, there are some nodes beneath that, but that doesn't matter. So why not try to join these data structures at statement nodes? And that's what we did. So this is what we call the code property graph. And what you see here is, well, you see parts of the AST which are still there, right? So those little trees. But you can also get from one of these statement subtrees of the AST to another statement via data flow links or control flow links, right? And now the idea is maybe we can describe vulnerabilities as walks in this graph. Now, once we get that data structure, the question was, how are we going to store this? And it actually took me quite some time to realize that, well, if you have a graph, then, you know, trying to map this to a classical relational database management system is not going to be much fun because you need to map a graph to tables. And it took me even longer because there are actually a lot of great reverse engineering tools which do exactly this. So for me, it seemed like the right way to do, right? And then I started to look at different kinds of databases. And I looked at document databases and mapping graphs to documents also kind of fails. And then I realized mapping graphs to graphs succeeds immediately. Now, why are relational database management systems not the right choice here? Well, as I already said, the underlying data structure that's being used here is the table. And the most obvious problem is that if you want to create relationships between nodes, as you have in a graph, right, just arrows from one node to another edges, then what you typically do, so pretty much all you can do is create a so-called join table. And in this join table, you will associate tables from items from a table on the left and a table on the right. And the lookup time will scale with the number of relationships. But what you really want is you want to be at one node and you want to have immediate access only to the relationships that are outgoing there. And that's what a graph database gives you. Now, I know that when you hear big data and cloud error, no SQL, you're probably thinking, no, this can't possibly make sense because the only people who use these kinds of words are the guys in the suits. But this actually makes sense from a technical point of view. And that's what I hope to show in the next couple of slides. Yeah. So the native storage format for graph databases is a so-called property graph. That's different from the tables. So property graph is just a graph, really. But there are properties attached to the nodes, which just means, well, you can think of it as having Python dictionaries attached to each node or Perl hashes or, yeah, just a key value pairs, really. And also the edges, and this is important for us as well, they're labeled. So it's not just there's an edge from A to B, but there's an edge labeled as nodes or created or something like that from A to B. And that's all that a property graph really is. And that's the native storage format of graph databases. But now the nice thing here is, once you have your data in this format, you can make use of different languages which have been designed specifically to query these property graphs. And there are two languages which are currently very popular. One is the Cypher query language by the Neo4j guys. This is not so well suited for what we want to do, most of all, because the Cypher query language doesn't allow you to have like the equivalent of stored procedures. And we really want to have this, as we'll see. Now, the other one is Gremlin. And Gremlin is really awesome. I hope to show this. So a typical Gremlin query looks a bit like this. You say, choose a set of start nodes. Then you have some start nodes in your graph. And then you describe where to walk the graph from there. And if that's possible, then the nodes that are reached in that way will be returned. So as an example, return all objects created by people over the age of 30. So it looks like this. So here's the Gremlin query. So you begin by saying, take all nodes where age is bigger than 30. So here in the example, that's Josh was 32, and Peter was 35. And now take all outgoing edges labeled by created. And then you reach this and this node. And now the nice thing is, once that you've created a walk like this, you can give it a name, right? So instead of always saying, filter it age is bigger 30, out created, you can now save people we can fire now. And then you can reuse this again and again. Yeah, and this is what a definition like that would look like. So it's very similar to a stored procedure. It means you don't have to ever write it again. And this is what we're going to make use of. Now, of course, for social networks, this is extremely useful. And the best example is Facebook graph search. So they actually made a UI where people can click together different kinds of graph database queries. So you could look for people who like English defense leak and Kerry, for example, or you could look for mothers of Catholics from Italy who like Durex and stuff like that, right? But the amazing thing is, you can store other useful things in graph databases, right? And the code property graph by definition really is a property graph, right? So you can store it in a graph database. And now we can use this trick here, you know, this taking walks in this graph and giving them a name to define domain specific query language for vulnerability discovery. That's the idea. And we did all this and made it open source so you can play with it. So this tool is called Euron, the robust code analysis platform for C and C++. Don't be fooled. It's mostly C. But if you throw something like Firefox at it, which is, let's say, nice C++, then it will also work for you. But if you throw something like the STL into it, then it will fail miserably. Anyway, by definition, you get an extensible query language because you have gremlin, right? And we provided a lot of different traversals, they're called like these pipes that you can immediately use. It's scriptable via Python because so many people like Python. And I also wrote a couple of shell utilities that you can use for like day-to-day auditing work. And it's known to work on other people's machines, which I'm very, very proud of. Okay. Yeah. You can download it here. Okay. So let's take a quick look at what this tool looks like. So you have an importer. You start to import code. That's here. Okay. And then it creates a database called EuronIndex. And this is then stored in the graph database. And then all you need to do is start the database server. And it will open up. Well, it will make a REST API accessible. This shows this kind of stuff comes from web guys, right? Now, to make sure that you don't ever see that there's web stuff underneath, I wrote a library called Python Euron that you can just include in your scripts and it will just communicate with the REST API for you. And then you can run your scripts happily. So this is a complete working script, except for the query, right? Here you insert your query. So you connect to the database. You run the query. You print the results. That's all. Alternatively, you can use the shell utilities. Now, let's test this on some real projects. So the first project we're looking at is the VLC media player. And it's just a short disclaimer about this. If you audit people's code, then that doesn't mean that you hate them. And it also doesn't mean that you disrespect them. Mostly, it means that their code base is interesting in one way or another. So the fact that this is popular made me want to audit it. I think the VLC developers are doing a really good job. And all of the bugs that are going to be disclosed now have actually been fixed now in the JIT version. And we'll probably then be fixed in the next versions as well. Okay. So let's make this practical. You run the importer. Euron on the VLC code starts importing the code. You start the database server. And you can then point your browser to port 7474. And you get some basic statistics of what's inside the database. So you can now see, for VLC, it's created about 2 million nodes, 5 million properties, 4 million relationships, and so on. And it uses about 700 and 5 megabytes. Yeah. So it's durable. So all of these experiments are being done on a laptop. So you don't need a server farm to operate this kind of stuff. Now, if you look at the Euron index, the database that has been created, you'll see that there's actually an Apache Lucene index here. Apache Lucene is just something that you can use to... That's actually document database. And this is used as part of the graph database to give you fast lookups of nodes by their properties. So you can say things like, give me all calls to malloc. And you'll immediately have them. So this takes up about as much space as the graph database, you could say. Now, let's start using this. So here's a very simple query with the Lucene query language. So as I said, there are shell utilities. Euron lookup is one of the shell utilities that you can use to quickly pipe a query into it and you'll get the results. And here it says, give me all the files where the file path contains the word demux. So all demuxers. And of course, you could have also done this with a find, right? But now it gets more interesting. You can also insert these Gremlin queries. And here I've created a custom query known as a custom step known as query node index. And you can pass the Lucene query in here to get a start node. But instead of just returning those nodes, we can now start traversing the graph starting from those nodes. So we can say things like, from those file nodes, go out and filter all the functions. So this will give you all the functions and print their name. So suddenly, you're going from files to functions and you get all the functions in files named demux. Yeah. That's what that looks like. Now, let's see how much time I have left. Okay. Yeah. So of course, you're going to be asking, okay, nice, but how do I know what's actually in the database? So how do I get from one node to another? And the nice thing is, if you work at a university, you eventually have some master students and one of our master students, they wrote a tool to visualize the graph database contents. So you can get, this is of course printed in a way that you probably can't see from the back. The point is just in those nodes, you see all the properties and you see the different edges between there and you also have the labels and also attributes on those. Yeah. So, yeah, you can use these tools. You can plot, for example, here we say, give us the data flow edges and the control flow edges and then it will print nice things that you can put onto slides or study to actually know how you can get from one place to another. Now, let's take a look at bugs finally. So I first wanted to look at things which are in a way similar to the lib ssh bug that Stefan had disclosed. So I usually begin this by formulating in words what I'm looking for. And in this case, this was get calls to malloc, where the first argument contains an additive expression and the call to mem copy is reached by data flow, where the third argument, st amount to copy, also contains an additive expression and the two additive expressions are not equal. So these are the kinds of things that you can easily find with the tools that we wrote here. And now, as a query, it looks like this. This is a custom step get calls to and, well, you'll find a lot of different ones. There's, for example, get definitions, get functions by name, stuff like that. This gives you the start notes. So get calls to malloc. From there, walk to the first argument. It's zero because we're computer scientists and we start counting it zero. And as a side effect, store the code that you see there. That's the first argument. Store it in a variable called cnt. Now, from there, look for an additive expression. And as I said, if there's no additive expression, an empty set will be returned here. So those will be filtered. And then from there, from all those mallocs where there's an additive expression in its first argument, go up to the statement that encloses this. And from there, follow data flow links, right? To calls to mem copy where the third argument is not equal to this cnt that we just stored up here, right? And there's an additive expression inside. And that's it. And I can imagine that if you see something like this for the first time, it looks really complicated, but you'll get very fluent at this kind of stuff. Then we just pipe it into your own lookup. We sort it unique. We get the locations, your own location, right? And we have a file containing locations that match. So cutting that file gives us four hits. And I immediately said, oh, wow, okay. MP4 looks nice. I have some MP4s. So let's look at the MP4 stuff. And then you can pipe it into your own code, which will give you the code. So this is what you find. You find a function called mp4 readbox name. And well, yeah, I would have asked you how to trigger the bug, but I wrote it in the title of the slide. So it looks very similar to the bug that Stefan found. So here you have inside the malloc, there's an addition. You add one, you subtract eight, so you're subtracting seven, really. And down here in the mem copy, you're subtracting eight. So now, if you insert a seven, then you have seven plus one minus eight gives you zero. So you allocate something close to zero bytes. And here, if you insert seven, seven minus eight is minus one. Cast to unsigned integer is max int. And there you go. You have your buffer overflow. So this is in the MP4 demuxer. Let's see. Yeah, okay. Now, an interesting question to ask here is, well, how do you come up with ways of, so what am I going to search for in this code to actually find bugs? And there were some wise words set in the Art of Software Security assessment. And I'd like to quote this. So they say, in fact, two of our authors typically start reviewing a new code base by finding the equivalent of the util directory and reading the framework and glue code line by line. And I found this really helpful. You know, if you start auditing a new project, take a look at the language that's being spoken in this code base. And those are the utility functions. So in this case, in the case of VLC, I took a look at vlcstream.h. This is the API that they use to analyze data streams of all kind. So if we find fundamental difficulties in using this API, then we're going to find bugs. Now, what I found is there's a function called stream size. And as you might have suspected, it returns a 64-bit integer. That's because a stream, a data stream, so this might be just an mp4. Of course, you want it to be bigger than 4 gigabytes, right? So any good HD movie is going to be bigger than 4 gigabytes, something like that. So the problem here is that on 32-bit platforms, well, you can't really store 42-bit in a register. So the size of the stream is going to be larger than the intmax on the platform. And that means that all allocations that depend on stream size, you're going to have to be very careful with those. And it's realistic for intact because, you know, if somebody downloads a movie of a file sharing application and the caption is right, then people won't be thinking, oh, this is 4 gigabytes, I'm not going to start this, right? So it's 4 gigabytes because it's a movie. There you go. So again, formulating this in your head, you get give me statements containing calls to stream size and the symbol int64t, where Dataflow exists to statements containing the symbol malloc, right? So you get allocations depending on the stream size. And this query looks, well, the sentence is not so long and the query isn't either, so it's, again, get calls to, which you just saw, stream size, and then go up to the statement. Now you're leaving the AST part. And now filter all those statements which contain int64, and then follow the Dataflow to statements that contain the word malloc. That's it. And you don't get so many hits, so those are all the hits that you get. And, well, they're all not so interesting, but this one is nice because this is the updater. So I read plus one depending on stream size. So let's take a look at this code. So this is the VLC automatic updater. This is the thing that checks in the background whether new updates are available. And you can see it connects to some URL. We'll see which one. And then it calls stream size and stores this in I read. And this is a 64-bit integer. Then it calls malloc on I read plus one. Now on a 32-bit platform, I read plus one, well, this will first succeed and they will do calculation with 64-bit most probably. But then, no matter what happens, this is going to be truncated to 32-bit, right? Because the size T on a 32-bit platform is 4-byte as opposed to 8. Now, if I read is max int, then, again, you get a buffer close to the size of zero. And now they call stream read, which is an API function that is similar to mem copy in a way, but much better as you'll see in a moment. And the read, I read characters into this buffer. So again, you have an overflow. Now you're going to be saying, yeah, but your update is going to be verified, right? Here's the nice thing. They actually do signature checking after they've downloaded using HTTP only. But this buffer overflow happens before the signature is being checked. Great. So executed before verification, nice. This could be something that we could be interested in. Now let's look into the binary and we'll see, okay, this is a static URL, update, video, and so on. And this here is, yeah, if you follow the link, you'll see this is the call to stream size. It's been inlined. And here you see the truncation in a very obvious form. So here the return value is actually stored in two registers because you can't really do it differently on a 32-bit platform, right? But then as they add one, they really just add one to one of the registers because they're going to be using a 32-bit value anyway, so they might as well truncate it here. And then this flows into the malloc and there you go. You have your integer overflow leading to a heap-based buffer overflow. Now I set up my little web server and pointed EDC host to update, pointed update video LAN or to that web server and created a file that is four gigabytes long and contained all ace. I attached a debugger and I checked for updates. And here's what happened. So that's very convenient because typically from the crash you need to do all sorts of work to actually get EIP equals your value, right? So this is 41, 41, 41, 41, meaning it's all ace. So you control where the program jumps next. And the nice thing is if you have this, then arguing in favor of patching just gets very much easier. Okay, so some notes on exploitation. Streamread is your friend. I just said this already. It's like mem copy, so copy state into your buffer, but much better. So first of all, the size depends on the content length field. So you don't have to actually transfer four gigabytes of data. All you need to say is I am eventually going to transmit four gigabytes of data. And here's some data. The attacker fully controls the amount of data to actually copy. And you also control how data is fragmented into different blocks. And now the nicest thing of all, and this is why we control EIP on this, in between those copy operations, streamread dereferences a function pointer. Okay? So if you overwrite this function pointer, then you control EIP. Now ASLR and TEP is enabled, but there is a bit of position dependent code that we can use to build a small ROP chain for the exploit. The downside is this is not so stable at least for my POC exploit. So I'm going to show you a POC exploit now, proof of concept, but it's, well, I only took a couple of days for this, so it's not very stable, but it will work at some point. Let's see. Okay. So this is the platform. Now I start VLC. And then I use the automatic updater. And it didn't work. And I tried again. And it didn't work. Also, POC exploits have stage fright, as you know. And it didn't work. Okay. Give me five tries. Almost. That's very close. That's very close. Just a second. Yes, there we go. Yes. So the reason this didn't work straight away is I didn't know sorts of work to ensure that the heap is in any, you know, controllable state. This is really just, you know, these were a couple of luck shots and then eventually it hit the right address. Yeah. I mean, it shows that you can execute arbitrary code, so this has to be enough to patch it, right? Good. Next. Yeah. So our second test subject was the Linux kernel. This is also well suited for static analysis, because there's a lot of drivers in there. And to actually fuzz these drivers, you would actually need to have the hardware to fuzz this stuff. So you find a lot of things statically as well. And also it's a large code base for a lot of users, so it's interesting. All right. Now here I want to show you something, and I know this is a lot of text on the slide. Here I want to show you that we can take very complex, we can define very complex traversals and then just hide them in a single word, which in this case is unsanitized, and then reuse them again and again. So for unsanitized, what I want to do is I want to be able to find cases where I have a flow from a source to a sync, but certain checks are definitely not in place. All right. So if you formulate this in complete form, it's the traversal returns attacker control sources, if and only if, there's an if missing, if they meet the following two conditions. There exists a path from the source statement to the sync statement in the control flow graph, such that no code on the path matches any of the sanitizer descriptions. So these descriptions are something that you pass in there. Now a second, a variable defined by the source and used by the sync reaches the sync via the control flow path, i.e., the variable is not redefined by any node on the path. So that's pretty complex, right? But you can implement it once and then put it into your step called unsanitized and reuse it again and again. And that's what we did here. So here you see a typical, so we say a tank style query. Here we look for buffer overflows in right handlers. So we start off by characterizing the syncs that we want to reach, like the sensitive operations. So get functions where the name matches right, okay? And from there on, go to the arguments or ith arguments it is now and yeah, no, go to calls to copy from user or mem copy and to their third arguments, yeah. And now filter those for count, right? Because count is typically something that's a variable you control in the right handler. And then here comes the unsanitized step. That's the important part. So the unsanitized step, here you characterize the different kinds of checks that you want to filter. So if this stuff occurs, then you don't care for this sync. So one of them is, is there some sort of comparison in here? We call this is check. Or does the code contain alloc? That means that the buffer is probably allocated to have the correct size. Of course, we don't know this, but let's filter all of these. Let's look for the cases where it's very clear that this is a bug. And min is not used. So min is often used instead of having a check. Yeah, so filter those as well. And make sure that the sources match count. And we have a similar one. We're not going to go through this in detail. But again, the unsanitized part is exactly the same. All we do now is we have a slightly different characterization of the syncs and the sources. And we ran these two queries. And, well, we had seven old days and 11 hits using this. So that was very cool. Now here's one of them just to show you what these kinds of bugs look like and what you can use this unsanitized step for. So this is an IOCTL handler of QF, that's an Ethernet card. You see here a call to copy from user, which is being used to initialize the variable recLen, so requestLen. And that means that this variable here is controlled by user space. Now, there's an allocation here on the way, but it's absolutely not concerned with recLen. Instead, recLen is next used in this copy operation down here as a third argument to mem copy. And this buffer, well, the size of this buffer has nothing to do with recLen, and it's a static buffer. So you have a classical stack-based buffer overflow. And the nice thing, this is why there were only 11 hits, is that we were able to actually filter those cases where there are different kinds of checks to ensure that this kind of stuff doesn't happen. So stack-based buffer overflow, we did not write an exploit, but looks pretty exploitable. Now, for the practical evaluation of all this stuff, I mean, I was just able to show you a couple of samples here. We did a larger evaluation, and Nico worked on this for a great deal. So he used this at, well, in a real company for Qualcomm, and found about 100 issues in an internal audit. Now, I don't know exactly what an issue is here, so can't comment on the quality of those findings. But then we also did an evaluation. So he did most of the work with this on the Linux kernel mainline. And we formulated different kinds of queries. So two queries for buffer overflows, you just saw them, although they were slightly rewritten, but yeah, you just saw them. Then we had zero byte allocation, memory mapping box, and some memory disclosure box. This one was rather complicated, and I need to publish it at some point. Yeah, and we found a lot of bugs. Kind of proud of that. So, yeah. And they were also acknowledged by the developers. So that's always nice to know that actually, these are not just bugs for security folks, but actually, they would also say it's okay. Ten minutes. Yeah, that's good. Okay, so conclusion. Yeah, I showed you a system to mine code bases for vulnerabilities. On the long run, well, here, I've already built a bridge between program analysis and graph databases. On the long run, it's the larger effort of this is can we have typical pattern recognition data mining techniques to discover vulnerabilities? And this is one part of this. And finally, as I just said already, we found real exploitable bugs. So, yeah, it seems to work in some cases. Thanks. Okay, thank you very much. As usual, we have some time for Q&A left. So we again have four microphones in the room. Please, all the people in the room who have questions, line up behind the microphones. And for the people on the stream, we also have a signal angel in the room who takes your questions on Twitter, IRC, and all your online, what it means. And we're going to go with the first question from the room for microphone number two, please. Yeah. So, do you have a database with queries ready? So I want it to be a good idea to integrate this in continuous integration system. Now, do you have a database ready with such queries? And is there any way to annotate code? So if I go and check if this reports a problem, can I annotate it in some way that we'll skip it next time if it goes through it? We currently don't have a database of like a lot of queries. We have some examples, right, that you can look at. But we definitely want to have this. This is more of a question of time that you're able to invest into this. But I also think that the idea that you're, I mean, essentially what you're saying is, if I understand correctly, can we not build a database of things that we know that have broken in the past, right? So that we can immediately scan the new code for this kind of stuff, if I understand it correctly. Yeah, this would definitely be very awesome, but needs some work to be done just writing queries. And the annotation part, I mean, can I mark some code that this was checked in its safe so it doesn't go again through it? Yeah, I have something interesting to say about this, but I'd rather take this offline because it's concerned with the work that we haven't got published yet. But yeah, about the annotation. Yeah, maybe we can discuss this after the talk. Thank you. Okay, next question for microphone number three, please. Yeah. Given that you support CNC++, I assume you're obtaining the AST from Clang Compiler, for example? No. Yeah, this is another story, but I was asked by several people to skip it because it's a bit boring. The thing is that if you, I said something about a robust parser, right? So the problem really is the job of compilers is to also check whether the code that you're giving it is actually correct. That means in C that you need the complete code base and you need a working build environment with all the libraries and stuff like that. I wanted something that's more like a search engine. So I actually built a parser for C that does robust parsing. So it assumes that the code that you give it is valid C++ in some weird dialect, some place on earth, right? And then it will just parse the stuff that it understands and stuff that it doesn't understand, it throws away. That's part of the project. Yeah, the problem that you have with LibClang, I mean, you could definitely write an interface for this. But the problem is typically as soon as it sees some code, like a definition it doesn't know or so, the rest of the source file is not parsed and we don't want that. Okay, another question from number three, please. Hello. Are there any errors that one would intuitively think one should be able to find using this method but which are not possible or too hard, too expensive? Do you have any interesting examples of that? Well, in a way, if you take a look at this, this is actually pretty intuitive. And I would have thought that it had been found a long time ago. I think that what's problematic here is maybe that you can only trigger the bug if you insert a seven. So if you take a fuzzer to find these things, and I think most people fuzz this stuff, then you're probably not going to hit the seven. But in a way, I think this looks so intuitive, but we still see that using these simple methods, we can find bugs. So I don't know if that answers your question. Probably not. Yeah, I think you answered the other way around. Okay. So I think that the system is good at. I was asking for things that the system is surprisingly bad at. Surprisingly bad at. Well, I mean, they are very obvious limitations, right? You were not really evaluating the expressions at all. As I said in the beginning, it's more like you try to model what people do who look at code manually, look for additions inside allocations or stuff like that. But you don't really evaluate the expression to see, yeah, that's still very abstract. I can't name anything concrete. But I mean, that's the biggest problem, especially if you compare it to like exact methods, such as symbolic execution or so, where you actually calculate what the real values could be, which we don't do here. Thank you. Okay. Next question from microphone number two, please. Oh, that's okay. I can just put my hat up. Thank you. I'm wondering, we have now seen that you evaluate code being C or C++. Well, or you said better C than C++. Is it that that you kind of used the grammar to parse when you said you wrote your parser? Because I'm mainly working with Java and I'm interested in searching for bad coding and possible bugs in Java. Yes. So we wrote a grammar for the Antelah for parser generator, which is a really awesome tool. So if you compare it to stuff like Flex and Bison, then I mean, with Flex and Bison, you spend lots and lots of time making Flex and Bison understand what you want to parse with Antelah. It's really simple. And yeah, we've also had somebody take a look at adapting this to Java. There are people actually using this for Java right now, but clearly, well, yeah, you're feeding Java to C parser. So yeah, most data flow inside the function, for example, that works very well. But if you want to take a look at class hierarchies and stuff like that, it would probably be a good idea to just adapt the parser to Java. Okay. I question for microphone number three. Can you say something about performance? Do you need to tweak the queries to make them fast or is it okay? Yeah, so it depends a lot on the amount of start nodes that you choose, right? Because if you have a lot of start nodes, what you need to start to do, and this is also in the documentation, is you need to start to chunk this because it will run all of these, it won't run these queries, well, it will run them all in parallel, because at some point it could be that you want to merge the results again. So you need to split this. The queries that we had here, those we presented, the slowest took about five minutes to execute on my laptop. So I guess that's acceptable if you do an audit. The part that takes most of the time, that was this unsanitized step where you actually start traversing the control flow graph again and again and again to see if there's any path, any reachable path that you're interested in. For simpler stuff, like this bug here, for example, or also the updated bug, this is pretty instantaneous. So it takes, well, maybe not instantaneous, it takes 10 to 15 seconds or so. Cool. Okay, I see we don't have any more questions from room. Is there maybe a question from the internet? Okay, no. So please thank you again, Phaps, for your very interesting talk.