 All right, so hello everyone. In the agenda, it says that I was supposed to be joined by my colleague Fabian, but he could not come because of logistics reasons. So if you came here just to listen to him, you will be disappointed. So my talk is called LLVM meets code property graphs. And today I wanna talk about this thing, like what is code property graphs and its application in the security and software analysis domain. So a few words about me. If you want to find out more, there is a blog. I'm on Twitter and also you can drop me an email if you have any questions and such. So I work for a company called Chief Left Security and we are basically building the tools for custom static code analysis. So my responsibility there is to make it work with LLVM so that we can handle languages like CC++ and so on like more of them. So before we start, yeah, I wanna ask you some questions. Can you see a problem with this code? Anyone? Just shout like, yeah, double free, nice. So it was like four seconds. Can you see a problem with this code? Use after free, yeah, that was extra faster, but can you see a problem with this code? So apparently you can if you have enough time and power you can probably spend some ages and to analyze it manually to see how it works and so on, but yeah, that's not practical. Therefore, we kind of have to use some tooling and gently ask machines to help us with this task. So I wanna show you some demo. So this is the basic example. The first one was double free, like how do you find it with the tooling? So first you need to create to law this program into memory and yeah, so that you can run some queries to analyze it. Then it's actually very easy, it's an easy example. Yeah, so to find double free, we just need from like data flow perspective, we just need to find a flow from one free call to the other one where the parameter is the same. So we have a source, which is the parameter of the free and we also have a sync that is also parameter to the function free. So now if I run like show me all the flows from sync to source, yeah, then it doesn't work because I forgot one thing, all flows. Yeah, here we are. So it says that the first call is at line six in file M and the second one in line eight. So you can run more sophisticated queries for that but we'll get to it in a moment. Yeah, so this tool is basically built on this idea of what property graphs and before that I have to make a step back and talk about the property graphs in general. So I assume that you probably all know what the graph is, if not, so basically you have some set of nodes and those nodes are connected with just some edges like some by some means and the property graph it's basically an extension of graph or it's also called multi-graph where the nodes they may have several edges connected like two nodes may be connected by several edges and those edges and nodes as well they may have a number of properties basically like key value things. Yeah, and the CPG the code property graph it takes several representations of the program namely it combines AST, it combines the control flow graph and program dependence graph. So in the end, basically you have this multi-graph somewhere in memory or not and normally there are tools like Neo4j or Apache something that allow you to work with those graphs. So basically you can just run some queries like if you do with SQL you can just like select from blah blah and this is basically what happens here in this example. So we just run queries against the database. So the beautiful part of this code property graph or at least this implementation affects that it's kind of multi-layered thing. So on this example at the very low level you have just like rough database representation and it's not really nice to work with as a human being because you just need to write lots of boilerplates for some things. So for the code property graph there is another level that adds some more like syntactic sugar so that you can run some simple queries and find more information. There are more overlays and all of them are, they are domain specific. So for example in our application there are some specific overlays for web applications saying that okay you have basically you just like load the CPG into the program and then you run a query like show me all SQL injections and it shows the flows from API route to some place in the code where the SQL injection may happen and it may be exploited by the user. Yeah, so if you're familiar with LLVM or Clank and it's architecture then you will recognize this slide basically so what CPG is it just another intermediate representation that just presents the program. So at the front end there is a number of yeah number of front ends for each language so that they emit the CPG and then CPG is being used by the back end. So currently yeah originally we were using the Neo4j it's the graph database but it's like too general and it just doesn't really fit our needs and it was too slow because if you want to analyze some big projects and you should load the whole database in the memory and for some projects it may require like hundreds or tens of gigabytes. So an overflow DB is just the same graph database that stores the information. It loads the information lazily and if it doesn't need anything it just can overflow it on the disk. So basically if you don't have enough memory on your machine you can still use it but you pay with performance because of the swaps with the disk. So and what I've shown this command line tool it's called ocular there is a counterpart it's called yarn so the yarn is the open source part it's yeah mainly targeted for C and C++ and it's using the fuzzy C front end and fuzzy means that it's not real compiler front end it just sort of set off regular expressions. It's yeah it's more complicated than that but that's yeah not important. The ocular is more it's a commercial tool but it's very similar to yarn it's capabilities but it has just more overlays for domain specific things like some enterprise apps and customers and such. So my goal today is to talk more about the LLVM to CPG. So originally the project was born because we have some customers and clients they use fuzzy C to CPG but it's not quite good for when it comes to C++ or when you wanna get some better idea of a program it has some advantages like you don't have to compile the code you can just run it on some source files and it will just work but so there were many requests to add better support for C and C++ so this is how LLVM to CPG started. The question that I'm always getting when I talk about this stuff that why it's LLVM to CPG and not client to CPG because I mean it's like obvious that you can get much better information from the source code and if you go to the LLVM level you lose some information. So we acknowledge that but we also were curious about some other parts like what are the other advantages and disadvantages and the other one is integration. It's arguable but in my opinion it's much easier to get LLVM bitcodes then connect some plank tool to archaic build system or let's say some Xcode projects on macOS platform. Yeah the other parts that support the languages so if we want let's say yeah it's very common on macOS platform that Objective-C code mixed with Swift. For that you cannot use just Clank you would have to or we would have to combine Swift compiler and Clank compiler and somehow cooperate with this but because of the LLVM like as a baseline we can just use multiple source languages in one place and also the language feature set. So C, C++, Objective-C combined they are just huge. We could have built the support for that but it would probably take several years and with LLVM it's much much smaller that basically means that the time to market is much faster. Yeah so these are the conclusions some of them yeah arguable but that's the current state. Yeah so the strategy was that we take LLVM bitcode as a baseline and then basically oriented by feedback we just yeah basically customers say okay we don't see something in Objective-C code and we just add support for that and we are driven by feedback in this case. So previously I mentioned that we have AST right the part of the CPG is AST but we kind of don't have that at LLVM level but still we can mimic it. So on the left there is very simple function identity that just takes a parameter and returns it and on the right you can see the LLVM bitcode that just not optimized does the same. So first we made kind of assumption that okay we'll treat every LLVM instruction as a function call. So it's not really useful for especially for the user for human beings who use the tool because they just it just doesn't make sense like lots and stores like it's basically working on the assembly level not as nice. So then we applied some yeah made some some other assumptions and basically okay we specialized some of those instructions and basically load and store they are just assignments was in direction so that we could build in the end from this bitcode we could build the AST that looks like this. So it's not one to one mapping just as an example. Then with this AST we could build the at CFG connections yeah and with this code now it's full now it's complete and we can run some queries and find some nice properties. So yeah as I mentioned we acknowledge that we lose the information but we are still curious like what kind of information we lose and the one is the yeah comes about types because this is probably the most interesting part when you do analysis as a security person because yeah what I observed people look for things like okay find all the members named a length and see if this length can be controlled by something and if this length is participating in the malloc calls for example because then you can probably exploit the stuff. So with the names of the structs itself of the types they are there some like most of the time they are in place and they available but we definitely lose the names of the members because for a Loviam it just doesn't matter it's just like indices zero one two whatever but we still can look at the debug information and if it's present of course and attach the right names for structs but that's certainly not the case for unions because at the source level we kind of have two members of a union but at the Loviam level it just like one filled like was one property and it's always so from the call like get element pointer it's index zero we cannot know whether it was using the A or B in this example. Yeah the other part that was a bit surprising not unexpected but the ABI plays a huge role in this conversion so at the source code level we have a function that takes a struct as a parameter and returns a struct and if on this the CPG level if you want to run a query okay find all the functions that return the struct what's it's named color then you'll not find anything because of the Loviam level it returns void and instead it takes two structs as a parameters. Yeah and it gets even worse with different types because in this example these are all doubles and they are kind of passed as a single thing as a struct but if we change that to integers then because of the again of the ABI of this specific platform you don't even have structs so there are just like two numbers and that's it they are then kind of decoded by the machine. Another part surprising one was the constants and it's actually by far the biggest problem so if here in the function main we define locally a constant that holds a number of pointers to some functions. At Loviam level it will be a global variable and global variables are not very nice when it comes to the data flow tracking and code analysis because you can do like intra procedural analysis and it's kind of small because you have just one function but if you need to also analyze the global variables then you have to find okay who may write to this who may read from this are there any aliases and so on it just the search space explodes and that gives us like really hard times to get it right. Yeah just as a work around so far there's just we are kind of weight listing some edge cases like this and making them basically putting them back like locally in the function but it's not a universal solution unfortunately. Yeah this is the other parts like the big so far like the biggest problem well not the biggest but one of the big problems that I just faced several weeks ago in source code you can define a type in a structure in header like struct and then you include this header into two different compilation units then you compile them both differently and at the Loviam level you have two bitcode files that have types like struct point and struct point in the other module. If you load them into the same context then Loviam must rename those to avoid collisions because basically the types are stored in just like map basically like key value storage and apparently there is no good solution for that so we want to de-duplicate these types but there is no way to do it at least currently in the using Loviam built in things. So this is one of the examples why it doesn't work because in Loviam types are considered equal if they point to the same object basically so all the types they are allocated on the hip and if you have two different contexts and you ask two like two flaws they are I mean they are the same type right but Loviam considers them to be different. So the first attempt to solve this problem was to use the function called easily out identical you can just run it on a struct and it says yes or no but it doesn't work. So the structs on the left they have identical layout and yeah it's all correct but on the right Loviam does not consider them to be identical because this function just checks that all the members they have the same type but in fact so in my opinion it's like it's under implemented in the base case that should not be like that or it should be renamed to like is members identical that would be more valid thing. So we discovered it much longer in the implementation and had to find some other solution. And the solution that we used that also didn't work well the IR linker or also there was a command line tool called Loviam link. So basically it just tries to combine two modules and to eliminate as much types as possible. So in this case it works perfectly because there was just one type point left and we can use it. But in this case if you have two different types from the user point a few like points and topple but they have the same layout one of them will be eliminated and so basically we're losing even more information. So yeah I would suggest you don't use IR linker if you want to preserve the type information or be careful because what it also does it may just screw the types in a way that there may appear a type that just was never existed in the high level thing. Of course you should not rely on this information and so on but still we want to rely on this. So our attempt to solve the type the duplication problem was inspired by tree automata and ranked alphabets. So basically a ranked alphabet is just set of symbols and each symbol has a number of parameters. You can think of it as like functions and you can combine those functions to form a trees and this is where the tree automata comes in and each tree can be represented as a string like single string. So then our task to just encode the types into such trees convert them into the strings and yeah then basically we compare the thing. So the next part I wanted to show you I tried to put it on the slide but it just like doesn't fit. I will try to explain it a little bit. So it comes, yeah it's about objective C. So that was actually very easy to like the easiest part was to recover the information from objective C because it's all there in the bit code. But there are some quirks like yeah should have found a nice picture for that but I didn't. Yeah so I'll try to explain using this example. So in objective C each class has a super class and a meta class. This is the by definition and a meta class of a class is the meta class itself and the parent class of a meta class might be either some other meta class or might be the class that defines this object like in the first place. So it's very mixed. Yeah just to mention that part. Yeah I think it went faster than I expected. I probably missed something. Yeah I'll just say what I missed to say before. So one of the big problems in this case or of big issues because the CPG thing it has kind of two users. Two one is explicit user a human being that is sitting and running those queries and the other user is the data flow tracker. And this is where we just don't have the right I mean there is no right answer like what to do because in a low VM if you run optimize if you optimize the bit code and you load it into the CPG then you get much smaller thing and data flow tracker is much more happy about that but as a user you basically again you lose the information and you cannot work with this. So this is one of the kind of dilemmas that we have. So our next steps for this project are basically to make it stable and to add more languages like more features from some other languages. I don't know like Rust, Swift, Fortran. We the other part that we're kind of started working on but not there yet is we want to hook back into the clank into the ASD and just basically extend the bit code like the CPG representation that we have on the base on the bit code so that we don't really rely on the debug information because it's not as nice. It doesn't always work especially if you have different versions of LLVM and different versions of compilers and so on. Yeah and the other part that we want to analyze more projects and to get some results because so the original work on the core property graphs, the Fabian actually the clique who couldn't join they analyze the Linux kernel and they found some vulnerabilities that were not known before that. So we want to kind of try to do it with some other projects. So yeah, some resources if you're interested I do recommend you to check out Yoran. It's free open source and such. There is also Ocular. It's commercial product but maybe you are interested in it at your company. So I did not talk, oh yeah, type equality. So I wrote down this thing. I think I would be happy to actually contribute it back to LLVM but I still have feeling that it's a bit too complicated and over engineered. So I'm kind of trying to get some feedback from LLVM folks and it's not really successful so far. So if you have some comments please be my guest. Yeah, the last point it's the link to the other talk that is more general like how you build tools based on LLVM. It's given at URLLVM, it's based on my experience building such tools. So I decided not to cover the same parts in this talk and just to refer the other one so that I don't repeat myself. Yeah, so I think you will get longer break after this talk so if you have any questions yeah, I would be happy to answer them. Yeah, please. Based on the number of lines of code of some product. Yeah. You said the Linux kernel, how much time did it take to generate the actual final CPG or something like that? So the question is how much time does it take to generate CPG for some projects? So yeah, generally it depends on the front end but for LLVM to CPG I can say that for, I tried it on Blender which is quite big but it wasn't full project and it was taking on I think on this machine like roughly 15 minutes. And we also did it on the Mac OS kernel and that was also like about 15 minutes just to get this thing. So yeah, it's reasonably fast I would say. So you essentially know what all of the systems and could you take a project and basically query does this use any network? So to verify that some random project you're going putting down internet is not setting anything. Yeah, yeah. So as an example, yeah, the question is if there are some way to find some patterns right in the code, can I see it like this? So or find like... Specifically the idea of putting something that you don't know about and saying does this... Yeah, what does it do? Like yeah, to mining some properties of this program like does it send network requests? Does it allocate memory and so on? Yeah, so I think it boils down to two things. So I'll get back to this. Yeah, this thing first. So this is the command line interface. It's like very similar to SQLite and it's basically the Scala. So first of all, one can write scripts in Scala interfacing with the CPG with the database and having those queries. For example, if you have some domain specific, so this is what our security team does actually. There are projects like kernels and some drivers they don't necessarily have, I mean they use malloc sometimes but sometimes they have some wrappers around those mallocs. So they kind of annotate the functions of interest manually and then they run some queries to show me those functions. And that's one part, so it's technically possible. So there is no, I mean if you ask me now to check if there are network requests there is no such thing but it's technically possible to build. And this is where this overlays come in. Yeah, as I said, they're domain specific. So we have in fact a number of overlays for goal language, for web applications specifically. Find all the routes, all the API endpoints in this application. So just as an example. So if I could annotate functions that say this can put something in a buffer and then this one is affected by X global variable and go can it find me the functions that can put this global variable eventually put something in a buffer? Yes, yeah, so generally you can do it. It depends on the scale and size but yet generally, so I, yeah, this shown this example from the demo. So this is basically what happens here. So it's very like simplified, like the most simple example. So our security guys they are doing some like really crazy stuff with it and I just not able to reuse it. But this is basically what you do. You define one part, you define the other part and then you find like some connections. Yeah, does it make sense? Yeah, okay, thank you. Please. So that's the beauty of LLVM because we don't really have to take care about that because yeah, the question is how do we handle variadic templates and such in our case? So because we work on the LLVM bit code level, this information kind of encoded into the signature into the name of the function or method. So what like in this case, currently what we do only just demangle this name. So that users can see something reasonable. But I mean, it's like we have all this information in the string already so we can parse it, we can rebuild the whole template thing but it just, yeah, no one asked for that. So yeah, the question is how do we handle the constant types in, yeah. Yeah, so far I think we don't have any special treating for that. I mean, it's the global variable, right, normally and there is no great support by the specific of the CPG because of the space explosion. So I mean, it can be implemented, we just didn't get there. But so far for, yeah, I think in most of the cases when we are interested in the global variables, these are either function pointers because it helps for the data flow resolutions or the constant, like string constants, for example, so that a user can see that, okay, these are the strings are used and so on. Yeah. Is it possible to compare that with something like the Clang static analyzer or is it just so completely different that it's incomparable? So you can compare that, I think to a certain extent. So this tool is more in its, like in a nutshell, it's more like user driven. So and I think the audience of this tool is security teams, like red teams, blue teams who analyze the code. And we don't really have such checks, the same checks that Clang analyzer has because I mean, because there is Clang analyzer and it's just, you can already use it. So there are other tools. Yeah, but we can do it. Sorry, that was a question, back. So on this, why LVM works like this? Why doesn't it automatically deduct the K-struct? Is that a conscious decision, design decision at some point? Or has it just the maximum of history? Yeah, so the question is more open to the community. So why does LVM not deduplicate the types by default? So I don't have an answer, I mean, I don't know. I don't know. Sorry. Yeah. The structure is super complicated. It's in this table or it's super complicated to have an algorithm which can decide whether two structures are equal from a layout point of view. You've designed a solution which relied on some graph and the layouts to recursively check if two strips are the same. There may be edge cases when you use some unions, some GCC extensions or Clang extension where this thing is not going to work because we're relying on obscure pictures which are not disabled by a combined time. Yeah, okay, the question is... Doesn't it work? Yeah, it's not, yeah. Sorry. I mean, I still can comment on that. There is another approach to that. So first of all, in our case, we don't really use the layout equality to decide, but we kind of use the structural equality. So that if you draw the structs as a graph or as a tree, they must have equal shape. So this is our definition of equality. And there is another approach. I think it's called like based on Hopcroft's DFA minimization algorithm. I think at least my colleague told that this is what IR Linker does or it looked to him that this is what IR Linker does. And there is also some other, I think, scheme project that they do this, they use this algorithm. But it uses different equality thing. So we, it doesn't, I mean, it doesn't fit our definition of equality. So this is why we do it the other way. So there are two questions, yeah. Same or does it take the type with a pointer to type and compare them? Yeah, so it doesn't treat pointers as the same. So that's the point and to avoid recursion. So basically, when we build this tree, we rename, we don't take names into account. We just rename all the structs. We just give them some ID, like numerical ID. So basically, if you see the struct that you already seen, like recursive case, you just admit, okay, this is the same struct. And we don't emit anything further. So it kind of, so it's a graph and we convert this graph to a tree. So there is no recursion in this case. Yeah, there was. I think pointers in LLVM still haven't pointy type, even though conceptually the idea is to get away from that. Once pointers are okay, that problem just disappears. It becomes trivial because you no longer have forward references, right? And then you could duplicate by default because historically pointers have pointy types. Yeah, that's, yeah. There is just another comment. There was a writer by Chris Latner that was about type system change around LLVM tree and there are some good points on why this type system, the way it is currently. Yeah, there were some questions here and one there, I think. Yeah, please. Yeah, so the question is when we get the byte code, it's optimized and how we get back to the source code. So it's not necessarily optimized. It's up to the users how to meet this byte code. It can be like totally unoptimized. And to get back to the source code, we can only buy the mean of debug information because all the instructions with the debug information, they have a notion like where they came from. And yeah, we can basically reconstruct some interesting parts. We can just look back at the source code and the AST and see what's in there. Yeah, like last question, that was one, I think. Yeah, it's a very good question. So the question is why, like basically, why don't we use the SSA form, right? And kind of invent something else. So we have some debates in the team about this once in a while as well. But so basically, historical reasons. So the tool was first built for, I think, with Java in mind. And this is the format, yeah, the way it is. And we have a number of front-end supporting different languages like Cisharp, Python, Go and such. And I think by now, only LLVM has this SSA form in this like pure SSA form. So if we want to use SSA form, we would have to change all the other languages, which is, yeah, quite an effort. Okay. Thank you very much. Thank you. Thank you.