 So for the talk, we thought this was a motivation, and I went to a short overview for those who don't know what I'm talking about, and those who don't know what the trend is. Then we talk about how cyber is realized and implemented in the system, and of course, short overview. So let's start with some motivation. So first, when we talk about graphs, the typical example are social networks, and this is also the example that we choose. In that case, it's the social network of walks. So each rotate here, we represent a walk, and the relationship between them is a hate shape, so they hate each other. And smart analysts would ask something like, who are the closest enemies of each other? So that's a typical question in this scenario. And if you want to express such a question in the system, you can do it, for example, with Neo4j, Psyfor, and then you have it very like this. So we have a mesh part, there is a walk, which means another walk, and there are one or two of them, so one or two changes. Okay, let's have a look at this, how this would be looking like in big jelly, which is the graph processing framework on top of a bunch of things. So this is the same, I think it's actually nothing, but it's expected to be. So what's happening here? Here we have something like a hate label, like a filter that we had before, and here this is the true means that we get the rates quite. So a lot of quotes actually express the same things that we just expressed with Psyfor. And another problem that we have here is typically these graphs are not homogeneous like this one, so we have different kinds of quotes, different kinds of actions. For example, we have plans, and we have the hobbits, for example, and we have not only the hate relationship, we have the non-leader, the non-leader, and the no-relationship between the hobbits, for example. And now at the end of this, we'll ask something like which two non-leaders hate each other, and one of them knows polo over one to ten hops. So let's also ask you a question, let's look at this, how we would say this in Psyfor. So this would be a very, so we have polo in one to ten hops. So we have some predicates here. Okay, how would you express this on a graph processing framework? You can try it, but no. So it's kind of our motivation, so we want to make graph queries or graph queries, especially, yeah, usable on those kind of frameworks. So this is good as data flow frameworks like Apache link, or for example, Apache Spark. So to be honest, of course, each of those systems has a specific use case or there are strong points. So we have here two dimensions, more complex than this, but for this one, it's enough. So we have the ease of use, which is like things of how easy is it to express the problem in the system, and we have the data volume. So it's about scalability, using more machines to compute. So on the one hand of the spectrum, we have graph database, like for example Neo4j. So you have the declarative languages, easy to express the problem, you have visualization, a lot of tooling, community, and so on. And on the other hand of this spectrum, you have graph processing systems. The well-known one is Apache G-Rubb, which is really dedicated to compute graph algorithm on a distributed cluster, so a shared mapping cluster, but only for this nothing else. Then we have a bit more extraction here. Those are the graph data flow systems. So like for example, it's also a graph processing system, but on top of a data flow system like Apache Spark or Apache Flink, graph X for example on Spark, Jelly on Flink. And now you see there's a free area in the right upper corner, and this is where we position our research project, which is called CRADUP, and where we want to combine the expressiveness of graph databases, not as much as graph databases, but a little bit less, but also the scalability. So that's kind of the goal of the project. And I will give you a short overview about the project before we dive into the Cypher implementation. So we built on the shared nothing cluster like I already said. So we use Hadoop and HDFS. It's our data source, data sync. So that's where we store our graphs. We use Apache Flink as a distributed operator execution layer. So this is where we implement actually our graph operators, and CRADUP itself is just a layer on top of Apache Flink, which uses the batch API of Apache Flink or the stream API. I will talk about this maybe later. And CRADUP what it provides to you is an abstraction of property graphs on top of this data flow system. So this is the data model that we implement. I will talk about this on the next slide. And we provide a set of graph data flow operators. So one of them is, for example, pattern matching, which we will talk about in detail today. But for those who don't know what Flink is, on the website, they quote, it's an open source streaming data flow engine and it provides to you data distribution, communication, and fault tolerance. So all the hard parts of parallel and distributed programming are handled by this system. It does this really nice, especially for the streaming use case. But it also has a batch component, which we are currently building on. So what are the basic abstractions that you have in Apache Flink? So one is a data set. A data set is just a distributed collection of data objects, like pojos, so normal Java objects, or tablets, things like that. And then you have transformations. And transformations are just operations on data sets to produce another data set, like, for example, map or reduce. So for example, we have a data set here. A transformation produces a new data set. And a data flow program is just a composition of those transformations. And this is the way how you program in these kind of systems. So you have to translate your problem into a data flow program using those abstractions that the data flow framework provides to you. So what Flink does for you is the parallelization. So it scales out the computation and the data to multiple machines, computes it, and gives you a result. So what are those abstractions? I guess it's a bit small here. Just categorized in, so this is Flink data set transformations, unary, so which means one data set as input, and binary two data sets. And you have the common suspects, like map and reduce. And these SQL-like things, like filter, group by, blah, blah, blah. And we have the binary stuff, like join, for example, union, co-group. And there are special operators for iterations, so like data iteration or bulk iteration, which allow you to express, yeah, iterative algorithms. And most graph algorithms and machine learning algorithms are iterative to express on this framework. We also use it. So Gradube, on the other hand, like I already said, is an abstraction on top of Apache Flink, which provides you the data set API. You have it underneath, but it provides you already a property graph, abstraction on top of Flink, plus several operators. So the basic data model that we have is a property graph. So we have vertices and directed edges. And we have this concept of so-called logical graphs. So we have not only one graph, but you can have multiple graphs inside your database, which can also overlap. And the concept here is that these graphs are also first-class citizens, like vertices and edges. And they can be input and output of operators, of transformations. So this is why they have also identifiers. So they are uniquely identified in the system. You have type labels. Maybe you can't read it, but it's like, this is an org. This is a clan. This is an area they live in. It's Mordor, so it has attributes. And here we have the hobbits. So we have attributes everywhere and labels. And you can just load this into the system. The abstraction is already there. You don't have to care about how this is mapped to the underlying data set API. So this is what Gradu provides you from the data model point of view. And it's called the extended property graph model because of these logical graphs that you have additionally to the regular property graph model, like in Neo4j, for example. So transformations that we provide are not based on data sets, but on logical graphs as input. So a single graph or a collection of graphs as input. We also have unary and binary operators. So for example, what we have here is pattern matching, which we will focus on today. And other things like grouping. It's an OLAP, graph OLAP operator, subgraph, for example. Binary operations, for example, to combine two graphs to a new graph or to compute the overlap between two graphs or the exclusion. Also, graph collection operators, they are kind of relational. So you have a large collection of graphs and you want to, for example, filter them so by predicate, like selection in relational algebra or also pattern matching on each of these graphs. And there are also these theoretical operators like union of collections or intersection difference and so on. So I want to give you just two examples how these graph data flows look like using the subgraph operator. So the input here is a logical graph. We have different kinds of labels. So the colors are the labels on these vertices and edges. And what you do as a programmer, you just say, OK, I have my logical graph. I load it from HDFS. It's just a method here, but there's a bit more happening. And you say something, OK, mygraph.subgraph. And then you give two predicates. So it's two lambdas, which are just filtering functions for vertices and edges. So what we say here is a vertex is in the output graph. It has a green label and an edge is in the output graph if it has an orange label. And of course, both source and target vertex of the edge have to fulfill this vertex predicate. So in this example, the result would look like this. We have only those vertices that fulfill the predicates and the edges between them that fulfill the predicates. So in the output is, again, a logical graph, which again can be input for another operator. So this is the data flow on the graph level. That's what Gradut gives you. And what it also gives you is pattern matching. So we have, again, a logical graph as input. Then you define a pattern. So for example, we are looking for subgraphs that start from a green vertex with an orange edge to an orange vertex. And then you have just the graph here. You say match. You define a cypher pattern. It's just green, orange, orange. And then the result is a graph collection in that case, which contains all the subgraphs that match this pattern. And the subgraphs are, of course, subgraphs of this input graph. So this is from a high-level point of view how Gradut works. And now we will focus on really pattern matching. So how we do this. Because it's just an example operator. We have multiple operators, but this is a very interesting one. So I will give you an overview. And then Max will give you more details about the implementation. So the steps are really similar to every query processing system. So you have the queries. This is the query from the beginning. We pass the query at first. So we've written our own Antler grammar. We could switch to OpenCypher. The problem here is that OpenCypher doesn't support the extended property graph model, but it's just an implementation detail. So then we translate this into object representation of this query. We translate the predicates to a conjunctive normal form because it's easier to evaluate them in that case. Then we use statistics about, for example, how many red labels are in the search graph, how many blue edges are in the search graph. There are more statistics. And then we use this. We build a cost-based greedy optimizer that takes the query, this representation, the statistics and produces a query plan. And then using this query plan, every node represents a physical operator which maps again to flink transformations. And the output, if you execute this, is then the collection of found patterns in the graph. And now Max will tell you a bit more about what's happening after this part. So we now have generated this plan, and then we execute this. OK. Yes. So as Martin said, after we have parsed and planned the query, we end up with something like this. This is the query plan for this particular query. And just to give you a short overview of what this actually means. So this is actually a tree. Maybe it doesn't look like this in this kind of visualization. But to execute it, we basically start with the leave node. So for example, we want to find all relationships that match this sub-pattern. So we want to find all relationships that have the type hate. And so what we basically do is we filter and project the collection of all relationships we have in our graph. And after we've done this for all atomic patterns, we start kind of building up larger patterns and we do this by joining. So for example, to get this sub-pattern, we join all nodes with the label, with label org, together with all relationships with type hates. And so we continuously build up larger patterns by joining together smaller patterns. In the end, this is, for example, this is the last operator in this plan. This is the first step where we filter out all non-valid results according to a given predicate, which is in this case making sure that org1 and org2 aren't the same and clan1 and clan2 aren't the same. So we throw away all embeddings that are all results that don't hold for this pattern. A slightly special operator is the expanded embedding operator. This is a traversal operator. You don't only expand one hop, but you can expand multiple hops using one operator. This is the star thing you see or might not see over here. So in this case, we specify you want to step at least or do at least one hop, but at most 10. Yes. And just to walk you through some, how this works internally and how this maps to blink. So every of these operators produces results, intermediate results or the end result, of course. And so what we did is we implemented two main, we have two main data structures. For once, this is the embedding. The embedding stores one particular intermediate result or the end result. So it has two types of, three types of entries. One is the list of ID entries. This stores the mapping of actual nodes or vertices in our graph. Two nodes or edges in the query graph. Then we have a list of properties. So every node, as you know, in the extended property graph model, every node and edge and even graphs, but that's not relevant here, have properties like names or like Frodo has a name or every Hobbit has a name. And we store this in a separate list and then we have paths. This is just an implementation detail to make sure that if we do this variable length expand, that we always end up with embeddings of the same size or with the same amount of entries. But it's not too important for now. To say these embeddings, they are kind of dumb. They only store things, but they don't know what they actually store. And to make sure we have a mapping, we actually store this knowledge as well. We have also embedding metadata. The reason we split these things is that since we are in a distributed environment, we eventually have to send data over the network. This happens usually when we do a join. And that's quite expensive compared to just accessing RAM or something like this. And so we have to make sure that our data structures and the send of network are as small as possible. And since these metadata information is the same for every intermediate result for every step, we can just have one metadata object for every note in the query plan. And this basically stores the mapping of which were available in our query graph, maps to which entry in the embedding. And the same goes for properties. So to have some examples, how this actually works under the hood. So we have three layers here. This is like our example layer. This is the actually data we have, our orgs. Then we have a more or less relational view. So kind of what we do internally is we map things to relations. And then you have the same thing as the Flink view of how our physical operators in the end map to Flink operators. So this is an example for a leave note. So we want to find all notes that are a hobbit and have the name for the baggins. So we scan through all our notes in the graph, filter out these notes which have the according label and the according name. And what we end up is with a couple of or maybe just one note and all its properties. And this can in a relational view, this would be a table with the ID and the properties of every hobbit named for the baggins. And as a follow-up step, we do a project step because we want to again save some data size. We throw away all properties we don't need anymore because why would we carry them if we never access them again? In this case, we are not really interested in any property of photos or we forward everything. In Flink, this maps to a flat map function. Flat map allows us to combine these two steps into one. This is some internal thing of Flink why that's better instead. We could also do a map or a filter and a follow-up map step but this would cost us more it would need more time. So, flat map is basically you have input, you have as input you have one vertex or in this case a vertex and you can output one or zero or even more but that's not the case for us. You can output a reliable amount of results basically. So, this is a leaf node operator that projects and filter embeddings over indices. This is always a start operator in a query plan. Then you have the maybe most important operator is the join. This is to join the sub-patterns to larger patterns. For example, you have this pattern you have the org which is as related to a clan and you have an org which hates another org to join this together to become a larger pattern. In relational view this is the same as for relational databases so you have two tables basically in our case it's two intermediate results and you join them together. We can only just join them because we have some constraints this is due to how graph databases work. You have to basically check that you don't assign the same node or the same relationship to different nodes in your query graph. So, you don't want to be this org and this org to be the same or you don't want this relationship and this relationship in this case they would never be the same but in other cases they could be the same. You don't want to traverse the same relationship twice at least depending on your settings but it's out of scope for this talk maybe. So, we take those two sub-graphs and join them together and get one larger graph. You can see that in the example we don't store this node twice because why would we? It would be a waste of space and again this maps to this time a flat join it's not a flat map because as I just said some intermediate results wouldn't be valid in the assumption of isomorphism or this relationship or node uniqueness. So, maybe the most interesting operator is the reliable length expand. This is basically, in our example this would be we have an org and we want to find Frodo Beggins in one to ten steps and for this what we do is we use a iterative expand. Under the hood this is basically just a iterative join with some, you can do some optimizations which hopefully help you making it faster. So, what we do is we have the left side which is only the org and then we have a set of node relationships and their nodes and then we kind of join them iteratively and yield every possible result of course if it's longer than one hop and shorter than ten hops. So, it's not that we only yield the largest result but we yield all intermediate results as well. This is what Martin mentioned before here we use the bulk iteration this is kind of a cool thing for your streaming processing systems Spark for example hasn't doesn't have a iterative operator Flink does and so we use this iteration operator to iterate over the graph and produce our results and as a last step we do filtering so we had this predicate that org one and org two are not allowed to be the same so obviously we have to filter this result and that's quite straightforward and this maps to the filter operator which is already present in Flink so that's quite straightforward. That's basically the operators we have implemented so far which allow us to run most side factories right now what we don't have at the moment is a return statement this is kind of work in progress but we will eventually have it and there is some caveats on how to map the projected results to a graph again that's why we don't have it right now and what we also don't have is aggregation but you can do aggregation by just adding I mean GraduP itself has operators for aggregation so what you could do is you do the pattern matching without aggregation and do the aggregation as a follow up but most of the rest of the Cypher standard we support right now so now that we have planned our query this is what we end up with for Flink so this is the visualization of what Flink generates so we have our query plan and then we translate it to a Flink query plan and this is for this particular query this would be the Flink query plan you don't see much here that's completely okay we just wanted to show you that it's quite a lot of steps you have to take in order to produce the result and this is just a more detailed view so you have basically your data source then you do a set of filter and project so you get all the vertices and edges you want to work with and then you join them over and over and in the end you end up with one data set which contains your results and then you can for example write them into HDFS or wherever you want to store them so what we want to do what we would like to do is that we want to integrate the Cypher Cypher Technology Compatibility Toolkit this is a set of cucumber tests which allow you if you implement them they tell you if you have implemented OpenCypher correctly hopefully we haven't really done any broader checks now if what we produce is correct so this would be perfect we want to do benchmarking we have some preliminary evaluations but we should go into more detail for this and for this we want to use the LDBC social network benchmark this is a benchmark specifically designed for graph databases or graph systems and we can do a couple of optimizations so right now we use a greedy planner we could use a dynamic programming planner hopefully yielding better travel plans and faster execution times we want to improve our cost model so in order to be able to produce a plan you have to estimate how costly each step in the plan would be right now we have a very basic, very simple model based on probabilities but this doesn't scale not really if you have this amount of joints and you can do reuse of intermediate steps so as you see for every org we have two orgs in our pattern now for every org we would filter the list of all nodes in our graph but actually that's the same since we only filter for the label org we yield the same result on both sides so we could reuse it basically which we don't do yet and another problem in the distributed environment is partitioning so if you have vertices which are with a high out-degree or in-degree so in a social network for example they increase the computing time on the particular node which is in charge for handling this node they increase the computing time unevenly to every other node in the cluster for example and so we have data skew which means that some workers take way more time to produce a result than other workers and it's a hard thing to partition the nodes over the cluster evenly so that you end up with an even computing time and yeah of course we want to support the rest of the Cypher features and maybe even introduce more Cypher features like regular path queries or support for this extended property graph model and here's where you can come into the big picture because we'd like to if you're interested in this and if you think that's cool and it's open source so you can work on this with us together and yeah just contribute if you like to and if you think that's a cool thing yes thank you