 Yeah. Hello, everyone. Thank you. So, hello. I'm Andrew from University of Leipzig, and today I'm going to talk about frequent subgraph mining on Apache Flink, which I just hide behind the title of From Shopping Baskets to Structural Patterns, because it's kind of related to something some of you might know from the term of shopping basket analysis or association rules. So, first of all, my question, who of you regularly works with graphs? And with data mining? Or with data mining? Yeah, and graphs and data mining? Only, all right. So, I slow down a little bit. And yeah, I try to make the talk not too much scientific and rather give you a problem introduction and challenges in the implementation on Apache Flink or in distributed data flows in general. So, the contents are as follows. First, I introduce the problem and explain the details about the current state of these algorithms. Then I propose a solution, which we found on top of Apache Flink in the context of the Gradube framework. Another question, who was attending the talk before? So, all right, quite a lot. So, all right. So, yeah, it's in the context of an open source framework for distributed graph analytics called Gradube. And finally, I talk a little bit about some optimizations of our solution if I find the time and, of course, conclude. So, part one is the problem. So, the general data mining, class of data mining problems is called frequent pattern mining. So, and we are talking about the transactional setting, which means we have a collection of things, for example, shopping baskets, click streams, XML documents, chemical compounds, whatever. It just depends on the data structure, whether we have a collection of, let's say, data sets which might have different structures. And in frequent pattern mining, we're interested in the patterns that, for example, occur in 80% of these things. And the frequent pattern mining can be categorized into the basic one is the frequent item set mining, for example, shopping baskets, where we're just interested in which things frequently co-occur together. Then we can, if we add a constraint that the order of items matters, then we have frequent sequence mining. For example, a click stream, we're interested in which paths on a website go users regularly or something like that. Then we can, again, extend it. If we also consider branching of paths, then we are at frequent subtree mining, which is a little bit more difficult in sequence mining, but still very closely related to these things. And then at the end, if we are also interested in all the relationships, so trees may have clothing cycles and stuff like that, then we have the problem of frequent subgraph mining. And an example are chemical compounds. So let's first go back to the simplest variation of this problem. Some of you might know association rules that this is something like people who brought bread and butter often bought cheese and wine. So we can use this for recommendations. That's what some web shops do, I guess. For example, if you already added two items to your shopping basket and Amazon, for example, then one way to recommend you other things is using association rules, which are based on the results of frequent pattern mining on past data sets. In the frequent subgraph mining, we can use it for similar things. For example, if we know that a substructure often occurs in illegal stimulants and we have an unknown chemical substance, which contains a substructure, which is known to be often part of illegal stimulants, we can have a look on this substance if it's maybe it's a candidate for a new illegal stimulant. I don't know if this example makes sense, but I hope you got the idea. So the problem definition, just to conclude the introduction, is our input is a graph collection. So our collection of things, in this case, are graphs. And we have a min support threshold, which specifies the minimum percentage of items or of search space things which have to contain a pattern to be considered to be frequent. And the output is the complete set of frequent subgraph patterns, which are supported above the given threshold, so the min support. And to just give you a toy example. On the bottom, I hope it's visible for most of you guys now. So on the bottom, just believe me, there are three graphs. This is often called a graph database. And, yeah, there for you. It's smaller, but now more people can enjoy the fun on seeing the graphs. So this is the input. It's a graph collection. So we see the first, I need to tell you, in our problem scenario in the context of gradoops, we are talking about directed multi-graphs, which is a difference to typical solutions, so which means our edges have a direction, like also in graph databases, like Neo4j, and we support multiple edges between any pair of vertices, which changed the problem a little bit from the algorithmic point of view. And, yeah, now we have three graphs, and we're interested in subgraphs, which we can describe by pattern. So a subgraph, just to not confuse you, a subgraph is a part of the graph. And we're interested in such subgraphs, which occur in at least 50%, so in two of three. And typical algorithms use the abstraction of a frequent pattern lattice, which contains different levels. In the first level, there are all zero edge subgraphs, which means just vertices. And we see we have a vertex A here, and A occurs in this graph, in this graph, and in that one. This is where the frequency of three. The same for B, frequency of three. We have C, frequency of one, because it only contains in this one, and thus we can say, all right, it's not above the threshold of 50%, which would be two, thus we can prune it. So now we continue and grow children from that. We have, for example, AABB, or this loop pattern. And there are children of the zero edge patterns. And we still see, okay, this one has a frequency of three, one and two. One is below the threshold, thus we can prune it. And finally, that's the largest frequent subgraph contained. We have this two edge subgraph, which is contained in two graphs at K2. Actually, this is the desired result. But it's not, and we do not implement it as a data structure. It's just the theoretical model behind these algorithms. And from the data structure point of view, of course, we have graphs, and we have patterns, which we need to store somehow. But the most important thing are embeddings. It's very similar to the previous talk with the pattern matching. So we need embeddings, which map the patterns to actual subgraphs. And in this case, we see that due to automorphism, so a graph might contain a pattern multiple times, which means this pattern is contained in this graph two times. So once is the green line. So this vertex is this vertex and this vertex is this one. And because this vertices cannot be distinguished by their pattern, we create a second embedding because of that. So this is everything you need to know before I start explaining the algorithms. So we have graphs, which are the input elements. We have subgraphs, which are parts of the input elements. We have patterns, which are nodes in our pattern letters, which are isomorphic to subgraphs. So isomorphic means there is a one-to-one mapping between vertices and edges. And this is the tricky point of these algorithms, the same as pattern matching. And we have embeddings, which are mappings between patterns and subgraphs. And the challenges in implementing such algorithms, especially in the context of shared nothing clusters, is that we need to meet the Dataflow programming model, in this case of Apache Flink, but this also applies to Apache Spark or actually even MapReduce. And what we need to find is an efficient algorithm, which does not rely on shared memory. And I hope I don't want to become too scientific now. So I just explain your algorithm efficiency. The problem of these algorithms is they contain the so-called subgraph isomorphism problem, which is NP-complete. We need to enumerate all isomorphisms between graphs, actually, which is very expensive. And we cannot avoid this completely. If we want to have the complete set of frequent subgraphs, it's not possible to avoid this problem. But what efficient algorithms can do, they minimize the isomorphism testing effort. So let me explain you most of this work has been done in the early 2000s, but not in the context of distributed Dataflow systems, but for single machines. So there were a priori approaches. These are based on a so-called breath-first search in the lattice, BFS, which means we process first all frequent subgraphs of level one. Then generate candidates, do pattern matching in the graph database, find out frequent ones and join the frequent ones to children. And then we again do the pattern matching and repeat this until there are no frequent subgraphs left. The good thing about these algorithms is that they only need one iteration per level, and it allows effective frequency pruning. But the problems of these algorithms are they contain the subgraph isomorphism testing in the pattern matching. They contain additionally subgraph isomorphism testing in the candidate generation. So when you have two frequent patterns, you want to generate a child. You need to enumerate all subgraphs of a certain level to generate children, which is also very expensive. You may generate candidates that don't even occur in a database because they're just candidates. There's no guarantee for them to occur. And the candidate generation itself is hard to parallelize without shared memory because actually you need to build the cross products of all combinations. It's just a lot of combinations. This is why the more efficient single machine approaches are based on a depth-first search in the letters, which means this is the BFS and this is the DFS. They go, let's say, from this pattern. This is, let's say, the root pattern to this pattern. Then go to this until no more frequent patterns are found. Then they jump back to a frequent parent and so on and so on. And the idea behind these algorithms is that we can skip some of the links in the letters due to specific tricks, which I will not explain now because this then really takes a lot of time. But it's possible with these algorithms to avoid checking all the combinations. And the idea is we have an edgewise extension of frequent patterns and the growth rules can reduce the search through a tree. For example, there's a G-span algorithm as a very popular example of this category. The problem is where the isomorphism still occurs is that sometimes we don't want to visit, for example, this edge or link in the letters, but we accidentally visit it. This is why we need to check the graph representation if it's really on this tree that we wanted to find. So if we look at these algorithms, it automatically finds canonical labels, which means we can compare the graph patterns by a simple string comparison. The isomorphism testing is reduced to a minimum. We just need to validate these labels, which is the end we only check patterns that actually exist. So we solved a lot of the problems of the a priori algorithms. But the problem is we need a large number of iterations for that, and it can lead in a shared nothing cluster with many, many workers. It can lead to unbalanced parallelization because we can maybe parallelize single paths, but not a whole level. The more idea of the new data flow systems is to bring the computation to the data, which means processing all the, for example, all the embeddings of the same edge count at the same time, and this is not easily possible with this DFS approach. Come back to our toy example. We see this is why I use dotted lines for some of the links in the letters. So the a priori approaches would go all the links, or there are even some optimizations, which maybe skip some. But still, this is the basic difference, and in a pattern growth one, we can just leave out the dotted ones. Okay, so now this is everything you need to know about frequent subgraph mining and the research in the past, and now we are facing this problem in the context of distributed data flow systems on shared nothing clusters. And the solution to this problem is something we call level-wise pattern growth, which means we just do a parallel depth-first search in the letters. It means we process the letters level-wise, but still skip the links we do not want to visit, and still avoid the isomorphism problem due to a candidate generation, for example. And this means we only need K-max iterations, which is the maximum edge count of the largest frequent pattern, and can skip these links, and we still minimize isomorphism testing. And this approach fits the data flow programming model very well. I will show you using some pseudocode how we can implement this on Apache Flink. So for all who don't know Apache Flink, the basic extraction is you have data sets, and then you have transformations on these data sets, where a high-order function is a parameter, and it's similar to lambda expressions, but for data sets. And the idea is that the function is executed for each element of a data set or for a group of elements, for example. But there is no, they completely run, are executed concurrently, and all things which require global knowledge exchange, I need to artificially, I need to create barriers in the data flow to exchange global knowledge. So that's the problem in the shared nothing context. And what you would do in a usual programming language, okay, you know these constraints, but the naive programming would be there would be a data set of graphs, there would be a data set of all frequent patterns, and there would be a data set of k-edge frequent patterns. That's all the data sets we need. And then we have a loop, which terminates until we do not find any k-edge frequent patterns anymore. And at the end, we're interested in all frequent patterns. That's what our algorithm should do. What can we do? For example, we do a pattern growth in the beginning, which means in which we use something called broadcasting and flink, it's like a very efficient cross product in a distributed data flow. We just send all k-edge last iterations frequent patterns to all machines, and then we can access them in the pattern growth step. And initially it's empty, so this is the dummy root in the letters. We apply the pattern growth using a map function, which means we update the graphs because not only the graph data set does not only contain graphs, it also contains a map in between patterns and embeddings. So which means in the pattern growth step we just check the map, all the keys which are also contained in the frequent pattern set we need to grow. And then in the next step, we report the patterns. So after we grew patterns, we report them using flat maps. Flat map is like a map function, but can be an arbitrary cardinality output. So it could output zero or many patterns. Then it now becomes a little relational. We group by pattern, so it's just a key, or it's very similar to the reduce function and map reduce. We just group by key, we sum the frequency, we filter out frequent ones, and we additionally need to filter valid ones because sometimes there are some false positives. And we found out in our evaluations that it's far more efficient to group and filter invalid patterns instead of validating them in this step because in this step we have to validate for each graph and each pattern and here we only have to validate for each pattern in the whole system. Which means imagine we have an input size of like 10 millions of graphs. We do not repeat the same operations 10 million times, but do it here and this is in comparison to the complexity of counting a larger data set because it's basically free. And finally we add the current rounds frequent patterns to the all-frequency patterns data set using a union operator and that's how naively everybody one would program it. But the problem is that flinks bulk iteration, it's a data flow iteration, it's not like a typical while loop, it's something related to the batch API of Apache Flink. It allows only exactly one data set to be passed along iterations. And the problem is here we have actually we have three data sets which we manipulate during our iterations. But there is a constraint that from iteration body we cannot access these data sets so this doesn't work. So the first one the FPK there's a quite simple solution to that. We just pull it out of the iteration and repeat us a little bit and just instantiate the set of KH frequent patterns inside the loop so we can avoid the external access. But this does not work for this union operation and our data set collecting all the frequent patterns of all iterations. This is where we use the very hacky approach. We just at the beginning of the data flow we added to the graph data set a new empty graph element which is actually an empty graph with an empty map and because it's an empty graph and we ensure that it's the only empty graph and the whole data set we can identify it. And then we extend this map function we do not only apply pattern growth but pattern growth or store. So we just check if the graph is empty we can use its pattern embeddings map to store frequent patterns. So because we can reuse we don't need any additional data structure I just used the data structure we already have and at the end we filter out this single dummy graph which was collecting all the frequent patterns and use a flat map which creates a new pattern element in the output data set including its frequency frequency we also can encode in the mappings and have the same result. So this really works. That was the solution to our problem. Okay. So this was the basic idea how we hope you understand the problem and how we implemented this on a distributed data flow system like Apache Flink and now we did some optimizations to this problem. So actually there are four of them and because two of them are very difficult to explain without becoming too scientific we just skip them. So one is it's even more efficient to do the validation step in the combined function because it matters to have fewer tuples before your network traffic is caused for everyone's programming with distributed data flow systems it seems obvious and we use a merge join instead of a cross during the pattern growth process but this needs a lot of explanation now. But what we also do we do a pre-processing where we use dictionary encoding and label pruning to reduce the input size and the most efficient thing is we do a pattern encoding and with a fast and effective compression technique. So the pre-processing idea is basically we keep only vertices with frequent labels. Then if we remove them we also remove some edges and then we consider only the remaining edges which are not, which never occur with frequent labels, frequent vertices and also count their frequencies and we can also drop all the edges with infrequent labels and based in these steps we have the frequencies and all the labels so we can just reduce them to create dictionaries between strings and integers and because the frequent subgraph mining contains a lot of comparison, equality, hashcode, building, whatever it's much more efficient to use integer labels instead of strings and the workflow is then as follows we first encode graphs, we process the actual mining and we decode the patterns because in many scenarios we do not want patterns consisting of numbers we just, we really want our patterns like RDF patterns for example. So just to explain this approach we see here there's a vertex C which contains, it's only contained in a single graph and we see here's a edge label B which if we only consider the edge label it's still frequent because it contains in here and here but if we first remove this C guy then this B1 to preserve consistency would go to and B is not frequent anymore and if we do this in a preprocessing these patterns will never be discovered and the second optimization technique is the pattern encoding and compression so we do not store the graphs in a typical way as you natively would program graphs using Java objects and so on because we found out that avoiding Java objects is one of the most efficient tuning techniques you can do in a patch of flink this is at least for this scenario so we use multiplexed integer arrays to represent patterns and graphs and embeddings we represent everything in multiplexed integer arrays which require k times, so the edge count we need six positions in the array for each edge and we know that we only have positive values and we have very low upper bounds of the values so the upper bound is edge count or the size of the label dictionaries and typically we do not exceed 4 to 8 bits for this case I say just typically of course you can create data sets where it's different but just say in the most use cases that's enough this is much less than 32 bits which is usually required for a single integer so we can use an integer compression and we can roughly in our data sets encode a complete edge from 6 to 1 integers in this case and we use simple 16 from this nice open source package for this and the effect of this is we decrease network traffic because the patterns are traveled over the network are smaller we have a smaller grouping key in the group by operation which means this is notably faster also the map access when we access the embeddings of our patterns we have much smaller map keys so also this is more efficient then additionally we experimented with graph and embedding compression to mainly to decrease memory usage and to allow faster serialization between transformations and this is also something we have learned this really matters in a patch of link and the decompression we execute only on demand for example embeddings of infrequent patterns will never be decompressed and so on so there are a lot of detailed optimizations then we did some optimizer evaluations so we used I do not talk about absolute run times now because it doesn't make sense if you do not have any background knowledge about the data sets and what does it mean but what we did here in these experiments we have two data sets one is synthetic two we have one million graphs 90% support but it gives already a lot of patterns this data set and contains directed multigraphs and the classical scenario which is used by all other work in the field of frequent subgraph mining some molecular data set with undirected simple graphs with one million graphs and 10% support so there is a quite low support and it's one million molecules I mean with other types of data you can have big data problems with larger input sizes but here the problem are the intermediate results and computing complexity so we see we used this data set and scaled up from 6 to 96 threads which is roughly the same as 1 to 16 nodes in a cluster, machines and we see that especially on this synthetic data set for which we optimized our argument for directed multigraphs it shows a quite good scale up and even for the molecular database it's not bad, it's not linear but it's still quite good and then we also evaluated the impact of our optimizations so depending on the data set in this data set again we created many many infrequent labels that this technique is very efficient of course but maybe there are real world data sets such like RDF for example which have these characteristics which we consider as the design of this synthetic data set so we see that overall scalabilities all parallelism levels so from 6 to 96 parallel slots we see that the dictionary encoding in this case can slow down the total runtime by up to 400% then we just avoided pattern compression so we only used the integer representation already like 10 times faster than using strings this brings a notable effect especially on the molecular database the embedding compression still has a notable impact probably due to the serialization time but graph compression it's a little tuning effect but it's not too much alright so conclusion what we did the first approach to graph collection frequent subgraph mining using an in-memory data flow system there are map reduce approaches but in this kind of system by best of our knowledge we're the first and we support directed multi graphs from arbitrary string labeled input so you can just enter a string labeled input file and you get string labeled patterns we do not need to do any external preprocessing it works but using flink requires some workarounds as you have seen and we think it's a much better choice than map reduce because in each iteration you have to update three data sets and this is also for map reduce you also need workarounds to solve these problems but these map reduce workarounds are by far more expensive than the workarounds you need for the patch of flink-based approach and the availability of the it's part of the Gradube framework this distributed framework for graph processing it fits its algorithm API the current version you will find in the master branch is not optimized it's still using Java objects and strings and stuff like that so it's not very efficient and the optimized version is still a messy research prototype but will be merged to the master soon so thank you if you have any questions to try this on real-world data set so evaluate this project or with a business partner I mean on real purchase and data no, we have the problem is the availability of such data sets so we have a data generator which creates this type of data but this is why we use the synthetic data set which contains loops, parallel edges some many automorphisms yeah, welcome I happen to know that MSD in Prague has a data set like that excuse me? I happen to know that MSD in Prague has a data set like that yeah? yes, so they're working on it what is in there? molecular data yeah, but yeah, molecular data of course I have there tons of molecular data sets but the thing is that they do not in molecular data sets you have only two types of edges it's a single bound and double bound you have a very limited set of vertex labels and there are very low number of specific patterns that occur like you have at some cycles so, of course you can tune the algorithm to molecular databases but we wanted to make a general approach fitting all the data sets and because we have no data set