 Gabor is here because he's done a really cool lot of cool stuff around graph varying, a complex model and large models and he will talk today about incremental graph varying. So let's welcome him and we are up for the recording. So welcome everyone, my name is Gabor Sarnas. I'm a PhD student at the Hungarian Budapest University of Technology and Economics and I also worked as a visiting researcher in Canada at McGill University. My talk today will be about incremental graph queries with open cipher. So let's dissect this title and first talk about incremental graph queries. Why would we want to use incremental graph queries? Well let's imagine a simple railway network where there are railway segments and trains are traveling all along the railway network. Obviously this is a safety critical system so if there is some malfunction or a design error in the system serious damage in property can occur or life can be threatened. So it's absolutely important that we get this right and we closely monitor this system 24 hours a day, 24 hours a day and seven days a week. So we define constraints that must be kept during the running of the system. For example if we have trains too close to each other we might issue a dangerous proximity and say that one of the trains must stop as they might crash. Also in this system there might be switches that navigate the trains and there is an interesting constraint in railway networks is that if the switch is set to a diverging position then the train cannot travel from the other direction because it might damage the switch. This is called trailing the switch. This is also something that we want to avoid because it damages valuable hardware. Okay so how do we order this as a graph? It's not that difficult we just say that each segment will be a node in the graph, the trains will be nodes in the graph and then the switch will be a node in the graph storing its position as divergent. So we add some edges now we have the trains are on segments, the segments are stored pretty much as a linked list and then we have the switch with the appropriate directions. So if we want to do the proximity detection constraint we say that if the trains are on adjacent segments or they have less than two segments between them then we might issue a warning. So as a graph pattern this might look like this we have a train segment and then there is a single segment or two segments between the segments that the trains are on. This is pretty much a simple cypher query you can just take the graph pattern and describe it in the usual cypher syntax. If you run it in Neo4j it generates a query plan like this and it executes and returns the values. So for trailing the switch we have another constraint it looks like this we have a train which is on a segment which comes straight from a switch but the switch is set to a diverging position so this is something that we would like to avoid and then we would like to warn the users. Neo4j is also capable of running this pretty efficiently however in such a system we need to evaluate such queries continuously so we need to run these queries over and over and over again and this is basically the idea of incremental graph queries. We register a set of so-called standing queries in the system and we continuously evaluate these queries upon each change in the graph. There is a quite simple algorithm which is called the RATE algorithm. It actually comes from 1974 and was originally designed for rule-based explorer systems. The main idea of the RATE algorithm is that it indexes the graph and caches the interim results in a network. RATE actually means net in latin so this defines a propagation network for the query and it stores the results so that it can only maintain the results upon changes. So how does this look like in practice? For the second constraint trailing the switch we have the graph pattern and this is our RATE network. As you can see you have your usual relational algebraic operators like joints, selections and projections. So if we take the graph then first what we do is index the graph for the appropriate edges. So first we gather the on edges to the indexer on the left then we gather the straight edges to the indexer on the right and now we start to calculate the results for the graph pattern so what we do is join the nodes that are connected. So we have the train on the segment and next to a switch. We do a filtering operation which enforces that the switch is in a diverging position and then in the end we project and we only need the number of the train and the switch that is next to the segment. So what's the advantage of this approach? Well if the train moves slightly for example it moves from E to segment D then we only have to recalculate a small portion of the network. In this case we propagate the change through the on indexer and then there are no more matches in the join operator so we have no more issues because the trailing the switch constraint is now not violated. So to summarize usually graph databases and relational databases use batch queries which are more request driven. There is a client and the client issues a query, the database calculates the query results and it returns the results. The results are obtained on demand so the client has to specifically ask the database for the query results. In contrast incremental queries are more even driven so the client registers a set of queries in advance and then for each change in the graph the results are maintained and this is a continuous loop. The query results are always available for the client. So in the past there have been many implementations for incremental query engines. One of the most well known is the CLIPS rule-based expert systems which was developed at NASA. There is Drew's Retabased engine which is part of the JBL stack and is developed by Red Hat and our university developed the Viatra query framework which works on the Eclipse modeling framework and is quite widely used in the modeling community. We actually have a spin-off company called Inqueer Labs which continuously develops Viatra and provides support. So these are the academic and industrial tools and we have some very prototypical tools like Instance which works on RDF and is developed by Altru University. We have a Scala-based solution called i3ql developed by the Technician University of Darmstadt and we have InqueerD which was my early PhD project in my university. So what we don't have is a property graph-based implementation so as far as we know there are no incremental query engines for property graphs yet but this is about to change. So first we need a good query language and a good ecosystem that we can integrate with. So what is OpenCypher? As you probably know the Cypher query language aims to be the sequel for graph databases so this means that Neotechnology tries to provide a standard for its query language. This comes with the obvious advantages that users will not be afraid of getting logged in with a specific vendor. Instead they can rely on the grammar specification and the reference implementation and also the OpenCypher kit provides a so-called technology compatibility kit which allows the developers to test their own OpenCypher compliant engines if they really comply with the rules of the OpenCypher language. So OpenCypher actually defines a smaller subset of the Cypher query language but it still allows you to do many interesting things. You can do almost all of the pattern matching stuff that you can do in a normal Neo4j Cypher. You can do your filtering operations. There are collections like lists and maps. There are operators for data manipulation and also more advanced constructs like variable length paths. So some features are excluded from the standard. These are called the legacy constructs. These are mostly implementation specific stuff like load csv, index, constraints. There are regular expressions which are notoriously difficult to get right between different programming languages. Some of the more sophisticated list functions for example reduce are removed because they these are very powerful and you can almost program with reduce in your queries and there are a lot of predicate functions removed, functions that calculate the shortest path, case expressions and IDs. And interestingly and somewhat accidentally some of the legacy constructs are also very difficult to handle internally. So in our prototype implementation we do not focus on the legacy construct but try to implement a meaningful set of the standard constructs in OpenCypher. So in order to implement our incremental query engine we first mapped the OpenCypher query language to relational algebra. This is quite a long mapping. We had to define some custom relational algebraic operators and we gathered our findings to a technical report and we published this technical report online. It's built continuously on every change in our system. So basically what you see in the technical report is the latest version as compiled by our parser and our our query engine. So we gathered a lot of queries from Neo4j use cases for example tutorials, our unit use cases, blogs, github and we translated these queries to standard batch search plans defined by relational algebra and then we translated these search plans to incremental search plans also defined in relational algebra. Some of the findings that we have is that some constructs even in standard OpenCypher are very difficult to get right incrementally especially lists because you have to treat the lists with the appropriate granularity and that's not easy if you have lists that are nested into each other and then you can unwind these lists and then you can collect these lists back together. This is something that's not easy to get right. Also standard Rata implementations use set semantics in contrast OpenCypher uses back semantics and you can also order your results and select the top n or skip some results in the result set. And as I mentioned shortest path and all shortest paths are quite difficult to get right incrementally. There are some papers in the topic so in theory it can be done but you have quite complex algorithms that do not always perform well in practice. So these challenging features in Cypher are those that are not traditionally handled by Rata implementations. We are looking into implementing the standard construct and see what we can do about the legacy constructs. So by now we have a tool which we call in graph an incremental in-memory graph query engine. Basically in graph works in a client server architecture the client registers a set of queries and then it continuously updates the graph through another queries and the client can either retrieve the query results or it can subscribe and get change notifications for new results in the query set. So how does in-graph look internally? Well it pretty much looks like any other graph database or graph query engine it takes the query specification and builds a query syntax tree. Now as I mentioned the OpenCypher initiative provides the grammar for parts in the queries but it provides an antler-based grammar and while experimenting we found that the antler grammar is a bit too low level so we kept looking and eventually found something called the Slyza software architecture workbench which implemented the OpenCypher language in X-text. Now X-text is an Eclipse project based on top of antler and it provides a really nice and high level way for accessing the parse tree that's built from the query specification. So we use Slyza's X-text language and then we build the relational algebra model through a code implemented in X-tend which is as you have probably guessed technology that's built on top of X-text. So this is a technology that allows us to very efficiently implement the algebra builder and we build this on top of Eclipse models. So the parser takes this input and then it builds a simple search plan which is basically a relational algebraic expression. So now what we have to do is somehow transform this to a retin network which allows incremental computation and deploy it. So first we transform it and that's some optimizations. Interestingly for the transformation and the optimization we use Vietra which on its own is an incremental graph query engine. So we use an incremental graph query engine to drive another incremental graph query engine. In the end we get a bit more complex network which is the retin network and we use our query deployer which uses asynchronous technologies and functional programming like Scala and the ACCA toolkit. So why do we use these technologies? Well if we revisit our slide on the trading the switch constraint we find that the retin nodes that perform the computation can be thought as actors in the actor programming paradigm and actually this model lends itself very well to the actor programming paradigm because you can think of the propagation of the tuples in the network as asynchronous messages and this works pretty well in the practice. We did a lot of experiment with parallel implementations and distributed implementations and they scale quite well but they are pretty difficult to set up. So currently in graph is a single node tool but it in theory can work on multiple computers as well. So one of the challenge that was pretty difficult is to get properties right because traditionally RETA uses simple tuples. You have primitives like IDs, string, swings, bullies, floats and so on. In contrast here we have something like t.number which is a primitive and then we have sw which is a switch which is a whole node in the graph. So what we do is we build the network first and then run a three-step inferencing. First we say that we infer the schema of the operators so we start with the schema of the indexers and then build up the tree. Second we start reversing down the tree and gather something called the extra attributes that will be required for computation or for returning the query results. So we gather the numbers and then we see that we need for the filtering the position of the switch. So we also note that we need the number and the switch position. We propagate this to the bottom. We notice that we have to propagate the number to the right and then the number to the left and then the position to the right so like so. And then finally we traverse the tree from the bottom to the top and infer the schemas again and then we have the schemas and this comes with the the advantage that we don't have to unnecessarily propagate the whole nodes and the whole relationships up these three and this saves a lot of memory in practice. So let's revisit the Royal Mail model. As I said in practice we want all of these queries to be evaluated 24 hours a day seven days a week. So continuously and all queries must be evaluated. Fortunately this is quite easy with the Retta network because we can use the Retta network to check multiple queries. So as you've noticed on the right we have the trailing the switch constraint and on the left we have a new constraint which takes care of close proximity so it checks if the trains are in too close proximity to each other. And as you can notice we have a feature called node reuse which means that if we have a node that is used by multiple parts of the Retta network then it can be reused and this saves a lot of memory if you have a lot of constraints. So the figure also shows that the Retta network is basically a DAG directed a cyclic graph and that allows it to reach a fixed point pretty quickly upon evaluation. So how does this Retta network work in practice? We have another animated slide for that. Basically we have a bit more indexers and then we evaluate the projections. We calculate the join. We now find that train one and two are too close together so we issue a warning to the user that one and two are too close together one is on A and two is on D. So sometime passes and our train moves from segment D to segment E and this shows the advantage of the Retta network because now we only have to propagate the changes to the indexer. We change the 2D tupper to 2E and then we have no results on the dangerous proximity results but we will have some results in trailing the switch because this train is about to hit the switch that sets to a divergent position and obviously as I said this is a maneuver that's illegal in railway systems. So I would like to highlight some use cases. We have a model validation framework called the train benchmark which is also built around a railway network but instead of doing runtime checks we do checks during the design of it. We have a static analysis framework for JavaScript source code that I will talk about in a minute and also I think that incremental queries can be beneficial for all use cases where you have standing queries on a large and continuously changing graph. So if you have a fraud graph and you would like to detect fraudulent rinks or users that are doing fraudulent behavior then incremental graph queries can be very useful. Also if you have a huge IT infrastructure that can also be modeled as a graph and obviously it's changing very quickly and you would like to have a lot of warnings in case something goes wrong. So the first use case the train benchmark railway validation is basically a benchmark paper with the implementation available on GitHub. We implemented a set of scalable generators. We can generate graphs in the Eclipse modeling framework, property graphs, the semantic webs, RDF format and SQL for relational databases and the benchmark defines a set of validation queries that must be enforced during the design of the railway network. We implemented this for more than a dozen tools and implemented automated visualization and reporting stuff. So if you have a graph query engine that you would like to use on large and continuously changing graphs it's worth giving a try to the train benchmark system. Another use case is the static analysis of JavaScript applications. This is a very simple example you have a division by zero in your source code and you can detect this by building a syntax tree and doing pattern matching on the syntax tree. So basically this Cypher query allows you to catch division by zeros in your source code. Now obviously this is a very simple example so in practice you have much more power when you have your whole repository represented as a graph. You can do stuff like that code detection in your whole repository. You can do type inferencing if you don't use TypeScript or any other languages that allow static typing for JavaScript and then you can use these for generating unit tests for your code. So until now we formalized most of the standard OpenCypher language and gathered our findings to a technical report and implemented research prototype for the Ray-Vay model validation queries. In this quarter we plan to use the TCK test which are provided with OpenCypher. This will allow us to increase the coverage of our mapping and probably will also reveal a lot of bugs in our parser and in our query engine and once we are done with that we will implement a more sophisticated optimizer using the technologies from the Apache stack probably Spark, Catalyst or the CalSight system and we plan to publish our benchmark results in a top tire academic conference. So to summarize I presented some open source projects the InGraph project which allows you to evaluate incremental graph queries in OpenCypher. There is a trained benchmark system that allows you to benchmark such queries on realistic large data sets and then we also have a very interesting demonstrator called the Modes 3 projects which is the model based demonstrator for smart and safe systems. This is a demonstrator that revolves around the idea of a cyber physical system and it uses a model array way and a lot of real-time computation to make sure that the model array way works safely. These are all available on github with continuous build and licensed under the Oculus public license. Thank you for your attention. So we have five minutes for questions any questions from the movie? Okay you start first. Thank you for the talk. It doesn't change when the graph changes so for example those query plans are often using cost-based optimizers for example and it's for example during the loads so the graph changes in statistics change does this query plan adapt to this? Yeah so the question was that do we use a cost-based optimizer or do can we change the query plan up on execution or do we use some statistics and this is a very good question and the RETA algorithm has some very good properties with this respect because once you have your query you can determine which graph nodes and graph edges you will need for computation and you can build only the lower level you can build the indexers from your current model and you can use these as an input for your cost-based optimizer and as I mentioned we plan to use the spark catalyst framework. I started implementing a cost-based optimizer on my own and it turned out to be very very extremely difficult so we thought it might be a good idea to use sparks or the calcite frameworks optimizer because obviously these these are more sophisticated and regarding the dynamic reconfiguration that's something that you can theoretically do with the RETA algorithm we had some experiment with that we have a system that does a design space exploration which is basically a search but it allows you to dynamically reconfigure a system so design space exploration will look for a better solution and then it will also give you the steps that you need to take to transform your system to another representation. This is something that is very interesting from an academic perspective but it's super difficult to get right in the real system so this is something that works just in theory yet. Okay next one. So my question this event propagation graph that I see in the screen right now it looks to me exactly like an event propagation graph in a reactive program is that true? Yes there is a lot of truth to it and basically this is very useful if you want to maintain incremental views so if you have as I said at the beginning the main difference between the so-called batch and incremental queries is that one is more pull-based and the other one is push-based this is a very nice fit for reactive applications I think. And each communicating messages? Yes so basically you have actors and obviously unfortunately the actor model allows you to store mutable state in the actor so this state is mutable but between the actors all the messages are immutable so you do immutable message passing and this allows you to run the calculation in a distributed way and in parallel. Okay you did some sort of pre-compiling your model so assuming that the railway system is static yeah what would happen if this network of rails would not be static would you still be able to pre-compile something or would this simply not be possible? So the question is that we if we took advantage that the railway network is static and only the trains move around the railway network so basically this didn't really take advantage of that you can quite easily do it with a whole pretty much fully dynamic model but there are a lot of use cases that look like the following you have a big set of static data you have a smart city and you have the buildings and the infrastructure and they don't really move anywhere and then you have the dynamic stuff like vehicles like weather people moving around RETA is I think quite a good fit for this but this is not a requirement this is just something that works really well with this algorithm. I have a question one thing that I could imagine it's quite difficult to do is aggregation so if you query content aggregations because you have to kind of be able to track back where the aggregations originate from and if they're lossy then it might be a difficult thing. Yeah actually this is listed in the challenges in the open cipher stuff somewhere between the lines so basically aggregation yeah it is collection and lists and basically even if you want to maintain something like a minimum or a maximum incrementally you have to store all of the possible candidates so all of the interim results and then you can you can serve this obviously you can store it in a heap or any efficient data structure but aggregations are quite difficult and are usually not supported by RETA engines. Okay cool so time's up if you have more questions for Gabor then please ask him later.