 Okay, so welcome everyone and thanks for the kind introduction Michael. My name is Gabor Sarnesh and I'm a final year PhD student from Budapest, Hungary and in this short talk, I will talk about how to make better software with graphs. So as Michael said, JavaScript has some sort of bad name to it, but it's difficult to dispute that it's very popular. So if you go on Stack Overflow, it's consistently ranked among the top languages with respect to the number of questions asked. And it's getting standardized, so there is a standards body that releases a new version of the standard each year called ECMA script. And essentially things are getting better for the JavaScript community is getting a better language. I'm not going to say that it's the best language or it's like the top most popular language, but it's widely used from IoT devices to the browser. So it's important that we make good JavaScript code. And one of the techniques to guarantee good source code is called static analysis. The full name is static source code analysis, which means that we test software without compiling and executing it. So we take the source code, do some analytics on the source code itself, and then try to check rules and check violations of these rules on the source code. This is a complementary thing to the traditional testing. So basically in most continuous integration systems nowadays, you have your development environment, you push code to the source code repository, it then gets compiled by the CI server, and it gets tested by unit tests and integration tests. And static analysis is sort of complementary to all that. It's a different step because it just queries the source code, does some analytics, and then as a separate feedback loop, it returns the results to the developer. This is quite popular, so I'm sure most of you have seen some of the cloud services like Codesc, Codeclimate, and so on. But the problem with this is that this is an offline feedback loop. So you commit your code and then you receive an email 15 minutes later that your code is violating this and that rules. Another approach is to use command line tools and ID integrated tools. If you have done some C programming, there is an old Unix tool called Lint. This is such a defining tool that actually it gave its name to the family of source code analytics tools called Linters. And if you're a Java developer, you're probably aware of Finebox or PMD. And obviously there are tools for JavaScript. There is ESLint, Facebook's Flow Engine, the Turn.js system, and so on. So essentially these give pretty good coverage, but all of them have some drawbacks. We tried to do analytics over JavaScript in the past and we found that there isn't a single system that allows users to define global rules, evaluate those rules efficiently, and can be extended by custom rules. So these are pretty difficult to satisfy at once. Obviously others have thought of this problem as well. So checking global rules is a computationally very expensive operation in large source code repository. And this is actually so slow that it's sometimes even difficult to integrate to the CI workload. So obviously there are a couple of workarounds. The first workaround is just don't bother with global rules, write your code in very modular, very separated way, and then use file-level static analysis. ESLint for one does that. Another workaround is to do some batching on your CI analytics. So you run your build and tests on each commit, but you only do a single analysis a day. And also you can do some custom algorithms. So if you make your algorithm smart enough, it's going to be fast, but then it's going to be very difficult to extend with new rules. So in short, we made two important design considerations for the product. We wanted to create a static analysis tool for JavaScript that allows users to define custom analysis rules, be those global or local, and it should provide high performance ideally close to real-time evaluation. So if the user is editing the code in the development environment, the user should be able to receive timely feedback on the changes that they made. So one of the cornerstones of our approach is the architecture and the workflow. It's all built around incrementality, which means that we want to do the analytics in a way that it incorporates the changes made in the code. So essentially, first, it analyzes the source code on the whole, and then for each change, it uses incremental processing. So if only a single file is changed in a 15,000 file repository, it tracks the changes to that file. Second, we want to use declarative query language. Now, if you're in the graph dev room, you can probably guess which declarative graph query language that is. We will get back to that in a moment. So this is the high-level architecture of our system. It starts off with the version control system. All your code is committed to the version control system. It's then loaded to the workspace of the analyzer where it gets transformed to a syntax tree, and it gets transformed to semantic graphs. We then load this to the graph database, and we get a set of analysis rules that we want to check, and then we perform continuous checks on the server and give feedback to the client continuously. So what are these steps? If you have ever played around with a compiler, those should be very familiar because basically, this is how most of the compilers work. Essentially, they start off with the source code which is a sequence of statements. For example, this is a very simple source code which says we declare a variable full which is equal to 1 divided by 0. It uses a component called the tokenizer to split this into tokens. Tokens are the shortest meaningful character sequences in the source code. So for this, verfu equals 1 slash 0, we get six tokens. The tokens then go through the parser, which builds the so-called syntax tree according to the grammar specification. For the source code line that we've seen, we get this syntax tree, and this is already quite close to what we want to use, but this is still missing some semantic information. It's missing on the scopes, which will be added by the scope analyzer, and it's missing on information, on various accessibility features. So essentially, the abstract semantic graph enriches the abstract syntax tree by adding some scope information. So we take this and add some more edges. Actually, once we have already added these edges, it's no longer a tree because it has some cross-edges, all the scopes are defined, and also the accessibility, kind, and other meta-information are added to the specific nodes. So this is like compiler construction in a nutshell, and you can see that even though we started off with a very simple experiment, like six tokens, a single line of code, we get more than 20 nodes and this can be a lot more. So for a very sophisticated line of code, we can get 50 to 100 nodes easily. So these graphs are pretty large. However, once we have these graphs, we can do all sorts of pattern matching. I said that we are going to use a declarative graph pattern language, and this is Cypher. So if you have a graph like this and you know a bit of Cypher, you can actually specify validation rules. For example, you don't want your codes to do divisions by zero. So you create this rule which matches the binding identifiers that are in a binary expression, then do a filtering where the expression is a division and the right value is a zero, and then do a projection operation to return the binding. This is very useful for the developer because the developer can fix that instantly. It's a well-known truth that the sooner the developer gets feedback on the errors that they made, the cheaper is to fix these errors. So ideally, we should give the developers timely feedback on the mistakes they made in the code. So workflow-wise, it starts with the developers IDE and the version control system. As a first step, code is loaded from the version control system and transformed to tokens, AST and ASG step-by-step. Then it's transformed by a set of Cypher queries and Java code and it's loaded to the graph database. Once we have this, we trace the core issues of the errors back to the source code and display the errors in the developers IDE. So once we have a workflow that like this running, actually it allows us to do very cool things. One of my favorite ones is type inferencing, because as you know, JavaScript is a dynamic language and it's very easy to write code that throws runtime errors because of type violations. Obviously, there are some workarounds for this. You can use TypeScript or other typed flavors of JavaScript, but you have a lot of legacy JavaScript code that's written in plain JavaScript, and type inferencing is key to use those in a way that will not return errors while running the code. So another use case is global analytics. Because we have this graph, we can do a lot of cool reachability style queries. We can do that code detection. We can do a detection of async await where you run an async, and it's dangling somewhere in the code, but you never do an await on that piece of asynchronous call, and you can do potential division by zero detections by propagating these issues to the code. So you can check whether value can be zero at the point of the time that it's evaluated. Some tech details, I think, will be very interesting for this audience. One of the key issues in implementing all this is that imports and exports are just crazy in ECMA script. You have a dozen ways to import and even more to export. So we have drawn this nice matrix of compatibility, and just to give you an idea of how long it takes to implement all this single black dot, which says that's compatible, needs like 15 lines of quite complex Cypher code to work with. So it's a lot of work to cover all these. Obviously, once we have this, we have to implement the algorithms. Now, as I said, we have some propagation algorithms where we want to propagate some property along the graph, be that a type information or the fact that this value can be zero or not. This is actually something that's called run to completion scheduling. So we give a set of transformation rules to the system and then ask the system to execute it until there are no more changes to execute. This is actually quite difficult to do in plain Cypher. So we use a mix of Java code and Cypher code, and the Java code does the propagations while it can. Another thing that we struggled with is efficient initialization of the database state. So in the first implementations, the initial build of the graph happened with Cypher statements. So we build the graph pretty much node by node with separate Cypher statements, and this obviously was quite slow. So we started to think around this with a bit, and used CSVs to generate the graph. So as the first step, we generated just two CSV files, one for the nodes and one for relationships and then used the Neo4j import tool to load it to the database. This is not a very sophisticated approach. You could do multiple CSVs or binaries or other things, but already this gave us a 10x speedup. So it's actually like one days of work and the initial load went down from an hour to a couple of minutes. We also stumbled upon regular path queries quite regularly, because there are a lot of cases when we need transitive closure on certain combinations. So for example, you can have a situation when you have function that's assigned to a variable, that's in another function, that's assigned to another variable, and so on transitively. Essentially, we would want to do a transitive closure style operation on those relationship types. The problem is that it's not supported by Cypher yet, so we created a workaround. The workaround is let's start a transaction, add some proxy relationships that go over those relationships, do a calculation for transitive closure on those proxy relationships, and then roll the transaction back, and essentially deleting those edges from the graph. This is, I think, a proper workaround but it's not the nicest way to express it, and obviously, the Cypher team is very well aware of this. So for the next open Cypher, there's a proposal for creating path patterns. This allows users to create an expression where there are several relationships type next to each other and then do a transitive closure style operation on that. Okay. So I said that incrementality is very important in this work, and actually, this was my motivation to start working on this topic. So as I said, we built our system around Cypher queries. As you probably know, there is now an initiative called Open Cypher, which aims to deliver a standard specification of the Cypher query language. It was released about two years ago, and it's been actually adapted by industrial vendors that are listed on the logos, and there are also a couple of research prototypes. Most notably, there is GraphFlow, which is developed by the University of Waterloo, and there is InGraph, which is my PhD project. Interestingly enough, both of these target the same goal, it's the incremental processing of Cypher queries. So you have a set of Cypher queries, and you can evaluate them incrementally, continuously in your system. If you're interested in some of the details, last year I was here in the same room giving a talk about the system InGraph, and here are a couple of my slides. So the way InGraph works is to first compile the Open Cypher queries to relational algebra, and then transform that relational algebra to a representation that's incrementally maintainable, and then use an incremental relational engine to calculate the result of those queries. In the last year, we expanded InGraph by a lot of new features. It now covers a substantial fragment of the Open Cypher language, including sub-queries, functions, aggregations, some of the data manipulation operations, like create or delete, and there are some features missing on the roadmap, like merge, remove, and more sophisticated expressions like list comprehensions, but it's getting nicely together, and the state of InGraph almost allows us to evaluate the most important JavaScript static analysis queries. So it's possible in theory, we have two papers on that. One is about the compilation of Cypher queries to algebra, the other is on the incremental maintenance of those relational algebraic expressions. So this is, I think, a very cool use case for InGraph. So as Michael said, his pet peeve is software analytics, and I think this is an area that's very important. We, as developers, we should strive to make better software, and others have realized the need for this and the usefulness of graphs for understanding code and analyzing code. So there is a tool called JQ Assistant. It's basically a code comprehension tool that scans the software, turns into a graph, and then you can use arbitrary Cypher queries to understand the code, and you can register a set of validation queries that you want to check on each build. There is a blog post on this, on the Neo4j blog. There is also Slyzer, which is closely tied to JQ Assistant. This is actually an interactive front-end on top of JQ Assistant, and the idea is the same. You take a bunch of Java files and so on, throw it at the system, it scans it, loads it to the database, and then you can use this interactive Cypher editor to visualize and discover your system. It actually has an Eclipse-based IDE, and as part of that IDE, it has a grammar to provide an editor. Funnily enough, as part of the InGraph project, we managed to extend that grammar by some new features. We actually added some features that were introduced in the open Cypher language recently. We added a scope analyzer, and you might think that you're not using Eclipse, so this is not very relevant, but actually X-text is quite independent from Eclipse, so you can run it in the web UI. This is an editor which allows you to refactor Cypher queries correctly. If you do refactoring operation, you can actually change the value of variable, and then it will trace it back through the query. Okay, to wrap up, if you found all this interesting, we have two thesis works on this from 2016 and 2017. These are very well-written and nicely illustrated works, and I think they are quite pleasant reads, so all these are clickable if you're interested. As a conclusion, I think it's fair to say that interesting analysis rules, at least some of them, require a global view of the code, so it's not enough to just scan a file and do a standard Linter-style analysis. Instead, you should use some graph representation for your source code, and property graph databases are definitely a good fit for this. They are very expressive, and the Cypher query language is quite easy to use and easy to understand, and in particular, these are very good use cases for incremental graph queries. So if you make sure that your system incorporates incrementality on multiple levels, you can end up with a system that's fast enough for real-time answers. These are the related resources that you can find on GitHub. Bear in mind that these are all academic prototypes, so they work some of the time on some use cases. They are more like proof-of-concept softwares, and I would like to thank the whole team that worked on this, the students, my colleagues, and Adam. Actually, Adam Lippei is my old friend, and he's giving a talk in the source code analysis dev room tomorrow. So if you came here just for the JavaScript part, and you were left unsatisfied, you can go there tomorrow at 4.20 and attend his talk. Thank you for your attention. Okay, questions? Yeah, we can talk here or find? Absolutely. So we have 12 more spaces for an after-fossum graph dinner. So if some of you wanted to join us there, if you want to join to talk to Gabor, feel free to come along. After our next talk. But any other questions for Gabor? Hey. Okay, so let me repeat the question. The gentleman is from Firefox, I understand. So the question is, what code repositories we have tried our code on. So you're probably concerned about scalability. Obviously, we went on GitHub and grabbed a couple of the source code repositories. Most notably, there is Trezorit, which is a cloud storage system. So it's like Dropbox, but it's more focused on encryption and security. There, front-end library is approximately 70,000 lines of JavaScript code. That was the one that took us like an hour to load, and then we optimized it and it went down like to like five, six minutes. So that was the largest one that we have used. We had a lot of struggle to get a parser that works well because we tried the bubble parser, and the problem with that was that it doesn't really provide scope information. So it's just DST, and it's very difficult to work with. But this logo here is the logo of shape security, the S on the figure, and they have very well-written library called shift-java. That's an AST builder, but that has a lot of scoping information. That's a very nicely written piece of software actually. It's beautiful Java code. The problem with it is it's not really maintained. So it's well-written and we have to maintain it now because it seems abandoned and we actually started to add the ECMA script 2017 features like async and await to the parser, which is a work in progress now. Okay. Any other questions? Yes? Yeah. You were saying about real-time. Yeah. A lot of your presentation. Now, what I'm interested in, what you quantify as real-time for this situation in terms of metrics. Okay. So the question was, what do I quantify as real-time? Well, essentially, it should be quick enough for developers to appear while they are working on the same file. So you're writing your file, make some changes, you press, control S, command S, and it should pop up next to the other error. So it should be sub-second ideally. But the reason why a mask stands in different size of code. Yeah. Different time is what it is, in terms of the check. Yeah. So the follow-up question was, different size of code repositories will mean different execution times, and whether there is an average that it should take. Well, for this, actually we plan to use the InGraph engine more extensively. The whole idea of InGraph is, is to build a huge cache on your queries, and once you have that cache, you can do, I won't say constant time, but very, very quick query evaluations, because you have all the interim results of your queries cached. So essentially, if you just want to introduce a small change, which you usually do while developing, it should be very quick, and it should stay within second range. The other thing is also that these graph queries for the analysis of local queries, so they don't touch the whole code base or whole graph. Yeah. But something in the neighborhood of this variable, something in the neighborhood of this function, so all the callers of this function or something like that. But they don't touch everything that's in the code base. That's why it's also less code size-dependent, because most of these queries are local queries. Yeah. But I'm not an experienced JavaScript developer, but I believe there is also a need for queries that are global, like reachability stuff. Is this code, is this piece of code still reachable? Is this still correct from a typing perspective, and so on? So you can think of global queries as well. Did you ever do a structure detection so that you can kind of run something like a graph algorithm on top of the data to identify things that would actually belong into modules, but that currently are not modules? No, but that's like more motif detection. So the question was whether we tried to identify patterns in the graph, and no. But that's a very interesting mixture of network science and source code analytics. So that's a good suggestion. Thanks. OK, one last question, I think. Yeah? Yeah, so Facebook Flow actually uses custom algorithms to make the evaluation very efficient. So from what I understand, it does a few things and does those very well and very quickly. But it's very difficult to extend flow with your custom rules. So if you say that my company policy is that you cannot do this and that in the code, and I want to enforce this with my static analysis tool, then that's not a very good fit. That's, I think, the key difference. Yeah, something that we do in JQ Assistant as well is kind of enriching the graph. So you have the original source graph, and then like Gabo said as well, but for instance, scoping information, you can then take this original source graph and then enrich it with new technical concept or actual business concept as well, depending on your own. And then actually formulate the static queries on top of these higher level concepts. You don't have to talk about variables and expressions and functions, but you can say kind of all tests in my code base should have these in this attribute, for instance. I don't know. Component code should have that in that style or whatever. So you can move a lot of these things to a higher level abstraction when you write the queries on top of these concepts. And JQ Assistant does that for Java code already. So that's cool. If you want to contribute to any of these source code analysis frameworks with graphs, please reach out to Gabo or me. Yeah? You're more than happy to take it. Yeah, we could use some help, so absolutely. That's a cool start. Yeah, thank you again. Thank you, Lars. OK, in our last talk, we'll have at the end, talking about use of graphs in medical records and things like that. So when does the dinner start? At 6. 6. So we live together from here, I suppose? Yeah. OK. Hey, Gabo. Hey. Great job. Great job. Sorry. Like, stick or presenter? And if you want, that's all the list of parsers or something like that, that exists written in JavaScript for passing the STU and everything like that. I don't have a USB stick, yeah, regrettably. But can you just email me the link or something? It takes time to just go to everything and take the link. Yeah, but just zip it and put it in your box? That's in my heart, but yeah, I can zip it. OK. Can I give you, like, a business card or? Yeah, I have it somewhere. Let me check. Yeah, here it is. OK. Yeah, it's already online, I think. It's online. I did put it online. What's the link on the last slide? No, no, but. I promised to do JavaScript analysis, so I thought of this. OK, I'm just going to give it to you. I give it because I went to a one-year combat farm. I was very happy there. Because we are supposed to put them online, like half an hour before the talk. I did that, so I hope it works. OK. Do you have a new speaker? Do you need one? Oops, no, no, I think we took that. Yeah. Do you need both? No, no, only the. What's that? Hey. The other is another topic. OK. I think I should move the dog after you, right? Yes, sorry? I want to ask you a few questions. OK, yeah, absolutely. But I think there's a talk here, so maybe. Yeah, let's talk soon. How did you want to follow it? I think we can talk. I'm just going to shut this down. So no internet connection. Actually, my phone is. Because you are, well, a student, you should be able to exit the room. I am, but it never works on my Linux. You need to write credentials. Yeah, but it just never works for me. It should work. Besides that here, yeah. Here they are. So. So you should be able to see. May I have your business card? Yeah, absolutely. And I will give the microport to. Oh, yeah. No. So we'll have.