 So, hello everyone, I think we can start So hi, my name is Andreas and Today I'm going to talk about Code and graph technologies and I'm going to show you how we can learn new things about code when treating it as a graph So first of all, I want to thank of course the organizers to be here. I'm very excited and I'm also excited to see that There's such a big interest in this topic Okay, first a few words about me I'm a physicist originating and I started working on code qualities on three years ago and In fact, I have a small spin-off of the University of Munich called quantified code and What I do in my day job is to like think about Code quality and how we can improve code and so most of the stuff that I show you in this presentation is the result of that work all right so When I ask you to think about code Most of you probably would think of something like this here except if your systems program, of course, then you would probably like just see weird glyphs flying around and like Perceiving a matrix directly in your brain, but most people would like Think when they think of code as something like this the code is a collection of text files that we open in an editor and that we Edit and share with others to version control systems like it or subversion and in this talk today I want to show you that code can also be something like this not text for the graph Okay, so our journey and this talk will be like this first. I want to show you why graphs are interesting Then I want to show you how we can store our code in a graph Then what we can learn from the graph and finally how we as programmers can profit from the knowledge Okay, so to get started here's a 30 second introduction to everything you need to know about graphs so in this talk when I talk about a graph what I mean is actually a collection of Notes or sometimes called vertices that I show us circles here, which are connected to each other through so-called edges you can see an edge here and The thing is that an edge for example always has like in this case as a direction So it goes from this note to this note for example And it can also have a label in this case It's a class step and it can have some data associated to it the note itself can also have some Data like the no type for example and some other at you at some other attributes that are associated to it All right, so I'm graph some pretty old idea And I think what changed in the last years though is that we have a lot of new technologies and new solutions for handling graphs and Storing data in graphs for example, we have like databases like new 4j orient db Erango to be or Titan db that allows to store very large graphs and perform queries over them so in this talk I want to keep the Technology side a bit out of the picture and talk more about the generics of graph instead of talking a specific of a specific technology All right, so in programming and compiler theory graphs are nothing new either So grass has been used there for a long time Mostly in interpreters or compilers for various use cases such as code optimization code annotation Rewriting of code and also as an intermediate language that the interpreter would use before generating for example the machine code So what all those use cases have in common is that they are not intended for the end user So they're kind of like Only used internally by the interpreter, but we as programmers. We're not supposed to like Interact with the graph in any way. So and in this talk, I want to show you why What we can actually learn when when not sticking to this rule and one actually lead doing stuff with the code and With the graph and generating it Okay, so let's dive right into an example. Here's some Python code It's a simple function that and codes Dictionary or values so that you can store them in a JSON file for example. So as you can see, it's just a for loop that checks for the type of the elements in the in a dictionary and if it finds a complex value it writes that out separate the Imaginary and the real part separately into another dictionary and then it recurses down calling itself on other values of the dictionary so pretty simple and Now if you want to generate the code graph for this This is actually pretty easy in Python What the only thing that we need to do is to import the ast module and then tell it to parse a string And we can just pass it the whole thing here and then as the output we will get a data structure That looks like this so you can see that for every element in the code on the left We have a node in the resulting graph on the right here So for example the function definition here which would encompass this whole code is the first node in the graph And then we have various other nodes for example These ones here that are related to the function definition through the body edge and that contained for example The assigned statement that you see here which contains in turn a name and a Dictionary and on the other side also a name for example that is called in this for loop. So All right This is basically it so We could like We could in principle start working with this graph and like thinking about it and Doing stuff with it But what we want to do now is to not do not only do this on a small scale But on a large scale so we need some way to store this whole information in a database and Now I'm gonna explain you how we can do that actually so for doing this we can We can make use of the trick that for example JIT or Bitcoin uses and that makes use of so-called Merkel trees And if you have ever wondered why git can Like give you a very fast response when you have like a project with let's say ten thousand or one hundred thousand files And you change some of them and you make a commit then git doesn't take like hours to recalculate The state of the project what gives you an answer in a few like milliseconds usually and the way git does that is by treating the whole project as a as a tree or the graph of hashes and You can see here an example for this So this is for example from the last project and when you make a commit in JIT Then jit would only would have an associated snapshot tree or commit tree With which has for example this upper note here that contains the whole project and which also has a hash That is the hash value of this given commit and then this This graph here would contain several sub nodes For example for the flash directory or the docs directory Which in turn also have hashes and this would continue until finally we did some some blob Blobs so files in this case case that also have an associated hash and now the Merkel hash works by taking by starting at Lower most nodes of the tree and then working its way up the tree every time Generating the hash of the given element using all the properties of this element as well as the hashes of all elements That are below in the tree So for example this hash here would depend on the value only of the of this note here But also on the hash values of all the nodes that are below that note in the tree and like this if we Change for example only one file in the project then jit wouldn't need to recalculate the whole tree or like the whole repository But it would only see okay Ah something has changed in the tree and I look okay Which parts of the tree have actually changed and then it would see okay This tree here is unchanged so I can just use the value that I have for that And then all this tree seems to be changed and would recurse down into it and would say oh finally this File has been changed so I need to like put a new value for this file in my database But apart from that I can just use the old data that I have so this is what makes git and other systems really fast and Actually the blockchain that we have in Bitcoin is the same thing only that we have an addition to the to the Merkel tree So called proof of work on top that verifies like all the changes that you make to this structure All right, and for storing our code in a graph we can basically directly take this idea and Take it a step further that is we do not stop like it does at the file level. So We do not say okay. This is a file We can generate a hash value for it and then we store it in the tree but we can now use the graph data that we have for the file and Store instead of the the file note here the whole graph of the code industry so to say and If we do this and we can have a look at again at our example from earlier We would start at the lowest note here We would then calculate the hash value for that note And then we would go up the tree and continue calculating hash notes for all the other notes in the tree until we reach the the final note here and with that we could then store each of these notes individually in our graph database of their relations and Have the whole code tree so to say in the database then and now if you would imagine that we Change something in our code. For example, we Change some parts of this assigned statement statement We for example change the variable name here the idea and we make some changes in the dictionary Then we could apply the same technique that we saw earlier to like store the changes in the graph database and so Again, we would like calculate the hash of this whole tree and then we would see okay Hmm somehow the hash of the hope of the dysfunction definition has changed Because here this note here has been changed So we need to like like store a new entry for that in the database But we would also see that for example this for statement is still the same So we don't need to create a new node in our graph and instead instead We can only we can just take an edge and link to that existing node. So That's pretty efficient. And if we look at this On the large scale. So here I have an example for the flask project where we store multiple commits of the project in a graph database and what I plot here on the axis are the Total number of edges and vertices that you have in the project. So it's kind of like the whole content of your code files Versus the actual number of edges and vertices that we store in the database and you can see in the beginning when we Add the first commit the number here is increasing rapidly because we are encountering and not a lot of new Edges and nodes so to say a lot a lot of new code that we need to like uniquely create in the database But then if we keep adding more stuff to the database, we see that We have many of the things that we add are already in the database. So we do not need to create new Nodes and new edges for that but instead we can use the ones that are existing so you can see here that When we are at 500,000 vertices in the graph, we only have to store about 8,000 actual vertices in the database. So this makes this kind of this Kind of storing code in a graph really efficient and also doable on a large scale And this is pretty cool because Using this technique we can then not we can then really store everything in the graph database because like for example IDEs they also store some parts of your code in a kind of a graph to for example do things like auto completion or Code browsing, but they never usually store the whole code in the in the graph database And with this approach we can actually do this and also we cannot only store the code of a single project in a graph database But combine code of multiple projects stored in the same graph database and for example See if there's some shared code between individual projects or some other things relations that we can learn from the code That we have in there and of course as I said earlier We cannot only do this for a single state of a given project, but for the whole commit history So we cannot only see the state at a fixed point in time, but also see the changes and The differences between individual states of the projects over time Okay, so the end result of this would look like this for example, this is again Graph of the flask project where we have stored various commits in the master branch In a graph, so it's about 30,000 vertices and about twice as many edges and Here you can see for example the modules then you can see several classes and a lot of functions and Now if you look a bit closer at the graph you can already see What I talked about earlier namely the the hashing of the individual nodes because as I said said eat every note That has the same properties will be stored only once in the graph database And you can see here in the center a few nodes that have a lot a lot of edges So that are incoming to them and these nodes are actually a special Types of the syntax tree in Python that tell for example to compile or the interpreter that we want to load the variable or we want to store something to a variable and so you can see that those nodes are exist only once and You have a lot of incoming coming edges into them. All right, so Now you're probably thinking man this guy is a bit bananas. I mean, what can I what can I do with this this graph? This seems rather useless. I mean, it's pretty but what can I learn from it? And so the next part of the talk I want to use to show you actually how you can work with this graph data and When it comes to graph databases and working with graphs there Like two things that you always do so first thing is in order to get a starting point for your exploration You need to somehow Select some edges from the some edges or some nodes from the graph and you usually do this by using some indexes that you have on the edges and the vertices to like for example retrieve List of all the function definitions or list of other nodes that you're interested in so as soon as you have this list of nodes you can work with them by just following the topology of the graph so for example by Going through all the upcoming vertices of a given vertex and like like traversing the graph in an interactive way Okay, a first example for this would be rather easy to Show us all the symbol names that we have in a given project Sorted by the user's frequency so again here. I use the flask project and I just retrieve some Vertices from the graph using an index over some node types for example all the function definitions Then I group the resulting vertices that I receive using the name field and I order them by their frequency in a descending way and then I can see for example, okay I have a lot of names which Contain the word index and I have also some example functions or other things that contain the names foo and bar Of course as any good Python code should So the next example would be to show you All version that you have a given function in the code base and this could be useful for example if you want to see At which commit you introduce a given specific version of a function or how like a specific function has changed over the whole version history of your project and this you could do by just starting from the Upper most vertex in your graph so the root node if you want and then following down Path that is given by the for example the module name Of the function that you're interesting in interested in and the name of the function So you could just crawl your graphs call your graph for that information and get get all the different versions of your function in return Okay, now you probably say okay, this is nice, but you can also maybe do this by using some like fancy reggae stuff so Why do you need graphs for that and I mean navigation and like exploration of the graph is only one aspect What for me is more interesting? What you can do with graph is code visualization So let's also have a look at an example here one interesting thing especially in large projects is to get an overview of how complex your code is because as programmers our everyday job is mostly to fight complexity and like Manage the complexity of large software project. So as an example here I We can analyze the cyclomatic complexity of a project And this concept of cyclomatic complexity is actually pretty old. It's from the stone age of programming so to say 1976 and It basically counts the number of different paths that your program can can take so to say So for example if you have a if statement in your code Then you would increase the cyclomatic complexity by one because your code can either go into the if statement execute that branch Or if the the condition does not match it can continue on the other branch of your code so and it's kind of helpful to imagine the Cyclomatic Cyclomatic complexity as the number of unit tests that you would need to cover a given piece of code So if you have a cyclomatic complexity of nine, it would mean it you need nine Unit tests or nine different assertions to like make sure that you test every branch of your code all right using our graph We can actually calculate the cyclomatic complexity pretty easily so here I have like pseudo algorithm for Python and what is algorithm does it also starts at root node of our project and it then just looks for nodes that have the function def no type so a function definition in the code and if it finds one it Chooses that node as an anchor and it initializes the cyclomatic complexity to but to the value of one because there's always like one branch in each function regardless if you have any if statements or not so and then it Traversed the graph following the outgoing nodes of the function definition all child nodes and checking for Different types of nodes for example for statements while loops if expressions if statements Etc. And every time and it counters one of them It just increases the counter of the cyclomatic complexity of the given anchor node and like that We can just traverse the whole tree and calculate the complexity of all the functions that we have in there And we can use that information then for example aggregating it by directories by files and by functions and then visualize it so This is what we have done here The visualizations that we produce is called a cityscape. It's again for the flask project. And so what you're seeing is The different parts of the code so each city block here would correspond to either a Directory or a module or a class definition in the the flask code and the individual buildings here would correspond to functions in that graph and The area of the building here is given as the AST weight Which is kind of the number of nodes that you have Below a given node in the tree, which is kind of like the number of lines of code that you have and The height of each building is given as the total complexity of that node for example of the function Whereas the color is the so-called specific complexity. That is the the complexity per AST weight so you could translate this more or less as how complex is your code Divided by the number of lines you have so we have a very long function that have has like only a very few branch statements Statements that it's not very complex and this would show up as green And if you have like a very short functions a function with a lot of branch statements and a lot of complexity then it would show up as red here and So you can see for example that you have like these two functions here the send file and the URL for function that have cyclomatic complexities of 22 and 14 and Would maybe be in for a good refactoring here. So Yeah, I mean a nice thing about this way of looking at the code is that that it allows you to get a very quick overview of the complexity and The structure of your code without actually going through your text files and to the code itself Okay, another example is Visualizing the dependency graph of a project here. We look again at the flask code base and We have a we have visualized Relations and dependencies between individual modules I have to say I cheated a bit when generating this because up to now. I only talked about storing the code or the The code itself in a graph database, but we actually need some additional information to generate this kind of graph here Which contains for example the information about the imports and the relations in the code Which we cannot just extract from the syntax tree. So This is something that we can do in addition But I will not talk about this in this talk because it's a subject on its own So in any case what we see here is that the flask project? Okay, it has a module called flask where most of the things seem to happen and This model uses some some other things for example this flask underscore Compat model that contains like compatibility code and other stuff and also the flask dot app module and what you also can see that there are a lot of like examples and Tests that mainly import the flask module, but do not depend on other things in the code base So again, this gives you a very nice overview of your code and it like gives you a feel for how the project is structured and We'll have them different modules interact with each other without actually looking into the code So if you wanted to get that information from just by looking at the code files You would probably need half an hour or one hour going through all the files and seeing okay What do they import? Which other modules are they used by and like this you have everything in one glance So this is another example where graphs can actually give us a much better understanding of our code than the code itself okay Another and interesting area when analyzing code is of course to find patterns and problems and normally when we edit code in our IDE or text editor and we want to like find For example certain names or certain things in our code base. We use regular expressions So for example here I have like a string hello word and we would like using the regular expression to like match either the German or the English version in that case now if we store our graph or code in a graph we can of course no longer use this because We don't have any text information available. So In this case the the hello world would be stored as a As a set of vertices and edges in our graph And so we need to think about a way to like do the same thing that we do in text That is like performing regular expression matching on the graph and it turns out this like many people have thought about this and There are various approaches. For example, there's Expass there are some like proprietary query languages like Cypher that is used with Neo4j or Gremlin that is more like a standard for how to perform queries on a graph and it's used by various other graph databases and In our work we have like developed our own language Which is just a bit more simple and which allows you to like use a regex like Structured syntax to perform pattern matching on the graph So if you would like compare these two examples again, we would like look for a word that Contains either hello or hello, so which would be designated by the special or operator here And which is has an outgoing vertex That is called that is can be reached to the follow by edge And that also contains a word that either contains the words world or about and so like this We could translate this pattern matching operator from the text world to our graph world So what can we actually do with this? One thing is that we can build our own code checkers So you probably are all familiar with pylint or pyflakes, which are tools to like check our codes for certain style violations or problems and using this pattern language We can kind of write our own version of pylint. So Here I have like an example of a very evil piece of code which is a try accept statement that does not contain a Exception type and which also does not contain any error handling So basically everything that goes wrong in this code here will just be swallowed up and nobody will learn about it Or see it so and if you as a team lead or as a programmer decide that you do not want to happen Is this in your code you could just write a regular expression for this that operates on the graph And the direct expression that matches this code is shown here. So as you can see it's pretty simple It would just look for a note of the no type try accept That contains in its handlers Section a no type of an exception handler which doesn't Specify a type so and this is here the the empty exception handler basically and whose body only consists of a pass statement And so using this regular expression we can now just go through our graph again and catch all instances of this pattern And we could even like if you want to make some exceptions for example for this where we have like also an empty Exception type, but we just basically use this to do some error logging and then re-race the exception This might be okay. So we do not want to match this and then So we could just modify our pattern and say okay. I want to match everything except Exception handlers that have in their body a Statement of the type right race and so we can again change our pattern and Make sure that we do not catch these false positives. So Compared to a normal linter like pilot. It's also it's not only easy to write your code checks But it's also very easy to adapt them and to change them For using them for example with new circumstances Okay So it's a bonus chapter. I didn't know if that time. Yes, but I have some time for this I want to talk about analyzing changes in your code base and So let's look at an example from the Django project software projects are of course not static so code is changed often and Programmers need to understand what actually happens in the code base for example when you do a new commit and if you look at For example github and you want to see okay? What happens happened in a given commit you would get a line so code line by line div Which would show you all the lines that have been removed and all the lines that have been added in a given file so in this case for example, we have changed a test function and we have like Removed one of the function parameters. We have added a decorator to the function and we have removed the class inside the function definition and Changed this statement here and added another statement So with text is really easy and you can just do line by line diff and see what has changed But with graphs, it's actually a bit more complicated because when we change our code in the graph What would happen for example is that like certain attributes of our graph would change like for example here would Have it introduce a change in this variable e and we have changed the name of our function And now if you want to like like actually have an algorithm that tells us what has changed This is a pretty tricky problem because What our algorithm would see is that only the the hash of this whole tree has changed and that some Parts of the tree seem to be changed, but it wouldn't be able to like identify these changes with each other So actually this is like an NP complete problem. So Pretty tricky so Actually, we're not the only ones to have this problem So I want to like talk about similar field, which is chemical similarity because in chemistry You also deal with a lot of complex molecules for example this one here, which is called ipigula tzatichin galat And it's contained for example in white or green tea and today It's under investigation for its health effects. So people try to find out if it's good for your health or bad for your health and What they need to do in order to do this is to like for example identify Chemical compounds that are similar to this one or other things that they can use for example to synthesize it or to like like reason about it And so the chemists they have special databases for so-called chemical similarity where they can for example Look for all the molecules that would as this molecule here For example contain various benzene groups or various types of molecule of atoms And like which give them as a result all the different candidates that they can use then for The chemical screening and for like trying to synthesize this or the molecules So and this is actually a pretty complex problem as I said It's NP complete and they have been various approaches to make this matching possible. For example, there's There's so-called Jakar fingerprints Which are just bit fields that contain zeros and one for different properties of the chemical molecule So for example, if you would have a molecule that contains benzene group You would have a one in place I don't know 132 of this bit array and if you have like a molecule that contains an OH group You would have a one in another place and and so on so you can like have for example 800 different identifiers for given molecules and then just perform like a bit wise and operation to see how similar to different molecules are on this given scale and Another thing that is also used quite often there are bloom filters because of a bloom filter you can efficiently test For membership in a given set so you could say okay Does this molecule contain benzene groups or does this molecule contain a certain substructure and it turns out that we can use these things for comparing code and For like solving our NP complete problem as well and again This is like the subject of a whole different talk and pretty complex So I don't want to go into the details here But I just want to tell you about some of the applications that this has One interesting thing of course is the detection of duplicated code so we all know that copying and pasting code is pretty evil and should be avoided and It's actually pretty hard to to find duplicates in code because program has changed the variable names They change some small parts of the code which makes it really hard to like detect these things using a text page approach And there's some interesting papers and some research about how to do this with graphs For example, there's this paper from Billy Jeff at all which is also the basis for their clone digger tool and that uses ASTs and some of the concepts that I talked earlier about to detect the clones in the graph of the code actually and Another application is of course to generate semantic diffs Which another example for this is like this paper here by Flurie at are where they try to like Extract a minimum number of edit operation From two states of a software projects So what they do is basically there they see the project in state a they see the project in state And they try to figure out What kind of edit operations you would need to perform to get from one state to the other one So which classes you would need to add which functions you would need to modify, etc So last thing is of course detection of plagiarism or copyrighted code Which might be less important in open source But very important in like corporates of their development and there are also some interesting papers about that subject here All right Okay, so last thing I want to show is then how a semantic diff would look if you would actually have the possibility to use To make a comparison of the different states of our project using the graph So instead of the line by line diff now We would not see for example that we have like removed the line in the beginning and added another one But we would directly see that we have added a decorated a function Similarly, we would see that we have modified the function by adding an extra argument to it And that we have like removed the class definition Remove this argument here from this function and also added this other statement And maybe this doesn't seem like a big deal But it changes a lot because we go from like oh added this line and they remove this line to actually saying Okay, I modified this function parameter and I added this decorator to the function which then would allow us also to like perform More Analysis for example trying to find the cause of a different of some error that has been introduced so we could say okay My code is no no longer working. So why is it not working? Oh, we added a new parameter to the functions But we didn't add the corresponding parameter to the function call And so we can like in an automated or semi-automated way reason about the state of our software using these semantic diffs Okay, so to summarize Text versus graphs, I mean both of them have their advantages. I mean text is really easy to write It's easy to display. It has an universal format. It can be shared and copied everywhere it Has though it's not normalized though in the sense that it's hard to like extract the information and like relate the different pieces of Information that are in it and it's hard to analyze as well as we saw and graphs on the other hand are easy to analyze They are normalized so they have like the relationships between the different pieces of the graphs inside them And they're also easy to transform on the other hand They're pretty hard to generate and they're not yet interoperable So there's no real standard for how to like exchange graphs in different formats and how to like work with them in a consistent way So if you would ask me how the future of programming would look like it's probably that we still Use text for small scale Manipulation and editing of our code, but we that we use graphs for doing things such as large scale Analyzes visualization and also transformation of the code. So we kind of should try to like use the best of both worlds All right, so this was everything I wanted to tell you and I hope that you have some questions And I will be happy to be happy to answer them if I can. Thanks I Got a one question. Did you know some language in more than an industrial application like national instrument lab view? Sorry, can you and did you know some language like national instruments lab view? Oh, yeah? I'm as I said, I'm a physicist. So I'm unfortunately very familiar with lab view Which is for the others like a graphical programming language We can just like draw lines between different things and like programming code like this And I mean, it's an interesting example of like graph-based programming But it also shows very well the drawbacks of a graph because it's as I said really good for to reason about programs But a pretty horrible idea to like interact with code in my my opinion. So, yeah, I don't know what your opinion is about that It's another kind of Of the same things reaching code that I didn't lay blocks. No text. Yeah, any other question? Are you using graphs on daily basis and your projects? Yeah, so we actually use different database technologies to store our graph if that was your question. Yeah Do you use them on your daily basis in your work for analyze the code? Yes, what databases use you mean we use or if we use do you use the graph technology not databases? the graph technology yes, we use that actually so most of our Products for example for code analysis are based on first converting the code to a graph then annotating it with additional information like about types and Relations between different parts of the code and then Performing these kind of pattern matching operations on them for example Is this sort of technology very good for spotting moved code, which is a thing that diffs do very badly I can understand like if you've got a class and you move it through the definition It would work quite well, but what if you've got like a group of three or four lines and you move them within a function Is it is it easy to pick that up so you can actually do this with this with semantic diffs? I mean it's easy to detect moved code if it's Identical so it has the same hash because you just need to compare different hashes in your tree And you can find that pretty easily what is more complicated is to detect like code that has been moved and then Modified afterwards so this is the whole problem that I talked about you know like finding like trees or like graphs that are Sufficiently similar so that you can transform one into the other with like a small set of edit operations So yeah, you can do it, but it's not possible in all the cases, but it's definitely better than with text-based diffs with regards to pattern matching can you You'd give the example with an empty except but Could you also like deform multiple types of exceptions or Any exception So you mean here if you could like for example specify a given exception type that you want to like Match yeah, yeah, like I don't want to see a pass with any type of exception not just the empty except Oh with any type of exception I mean then you would not even have a try except block because I think if you had like the try except You will always have this note of the type type try except in there What you could do of course is to modify here the type and to say okay I want to match only certain things, but you can provide like a list or Yes, you can turn for the yeah, yeah, you can also use like lists I didn't show that here, but you can basically use the same operators that you also have in like string base Regular expressions so you can say okay. I want to match this followed by Something else followed by something else again. So this is definitely possible now any more questions a Question in the front here Did you actually look into this flask sent file function whether it's really a problem your matrix says it is? Did you actually look into it and analyze it? Yeah, I mean the complexity of the that the code is complex doesn't mean that it's problematic or faulty and it's just basically an indication of Some part of your project that you should maybe Refactor in order to make it easier for other people to understand so but it doesn't tell anything about the Corrections of the program in that case. So it's just a way to see Where complex code is in your in your software project that was the question We didn't have more time for more questions. Thank you for the talk