 My name is Gabor Sarnas. I'm a PhD candidate from Budapest, Hungary and today I'm here to talk about the topic of graph analytics and in the context of using matrix operations and running analytics on investigative journalism datasets. So if you're here, you have probably heard about the Paradise Papers. This is a data set that was leaked to a set of investigative journalists. This is somewhat similar to the Panama Papers, but it's a bit more complex and it's a bit bigger. If you think in terms of big data, it's not a huge data set. It has two million nodes and three million edges, but even so, it has been used with success to uncover various interesting and shady offshore deals. So this is actually one of those use cases where graphs are used for the greater good. This is something that can help us to understand how the world works, reduce corruption. So this is really one of the coolest use cases of graph technology. The schema of the graph is rather simple. So you have some of these entities. They are connected to officers. They have intermediaries to do their business deals with and many of these objects can be registered to a certain address. So this is graph schema that you have probably seen or you have probably seen something similar to. There are node labels and there are edge types in this graph schema. So when we talk about graph data models, I think it's important to get the terminology right. If you open a textbook, it's often a graph like this. So this is some social graph. There are five people here from Bob to Fred, and these people have some acquaintance relationship. Each node is a person, and each edge is relationship. This is called a single graph, or sometimes it's called a textbook graph, which is a bit derogatory, and it has many other names. So if you review the literature, it's called an untyped graph, homogeneous network, or a monoplex graph. So these are all the same term. Of course, reality is a bit more sophisticated than that. So if we could just reduce the abstraction a bit, we could say that we only have persons in the graph, but let's see whether they have different relationships. So in this figure, the green lines correspond for business relationships, and the blue lines correspond for friendships. And this makes it a lot more interesting, because we can see that Carol is more oriented towards the friendship side, and then is more like the business center of this little set of persons. If you read the literature, you will find that this is called a multiplex graph, or an edge labeled, or an edge typed graph, and there are other names in network science called heterogeneous network, or multi-dimensional network. Interestingly, the expressive power of this graph is a lot stronger than that of the untyped graph, but it still miles away from the property graph that is served by graph databases, because we don't assign any properties to any of the nodes or the edges. So what do we mean by graph analysis and multiplex graph analysis in particular? When we do graph analysis, we often define a set of metrics, like the so-called local clustering coefficient metric. The idea here is that we study the graph in each node, so we start from each node V, and we look for the instances of wedges, which is like this shape, and this means that there is a node, and it has two neighbors that might or might not be connected to each other. If they are connected to each other, then they form a triangle, which is on the right side, so wedge can be closed into a triangle. This is an age-old idea dating back, I guess, to the 1960s, and it has been published in textbooks in the 1990s. This is a very popular tool in social network analysis. To determine this metric, the clustering coefficient, the idea is only to calculate the ratio of the triangles, the wedges that close into a triangle, and the number of wedges in total, and they have found that this is a very efficient way to characterize social networks. However, there is more to this story than that, because we can actually apply the ideas of multiplex networks to this clustering coefficient metric. We named it the type clustering coefficient metric. We have taken this from a paper that was published five years ago at the physics journal, and the idea is actually pretty simple. We say that we are looking for wedges that have two edges of the same type, and the third edge might be a different type, and if they close to a triangle, that third edge must be a different type. When we calculate the TCC metric, the type clustering coefficient, we do the ratio of typed triangles to the number of typed wedges. Why is this useful? If we take a look at the example and we just calculate the simple LCC metric, we can see that it's actually two-thirds for all of the nodes. It doesn't really tell any insight into the story. All people are equal according to this metric. However, if we take the type information into account, we can see that some people have more influence or are more prominent in certain aspects of clusterness, and we can see that D has 0.5, which is a reasonable value for someone at that position, and interestingly C has an even higher value. This is something that can help us start with our analysis. So previously, back in 2016, I published a paper with my advisor and my peers about the characterization of engineering models. When I say engineering models, you should really just think graphs that describe a software hardware component or a building or something that has been designed by an engineer. These models are about 100,000 or at most a million nodes and edges. Back then, we implemented the Java application that uses edge sets to calculate metrics, including the type coefficient and the local castering coefficient. We were very confident, so we thought we can just run these analyzer components on the Paradise Papers dataset and it will work out of the box. Then it turned out that it didn't. It actually took multiple days and it still didn't finish the calculations, and this got us thinking, what's going on? Why doesn't it work for a dataset of the size of the Paradise Papers dataset, which isn't really big? It turned out that there are many implementation issues, systems aspects that we need to get right, and there are also algorithmic aspects that needs to be studied. Just to take one step back, which part of the graph processing landscape are we at. I'm going to use the terminology of the linked data benchmark console, abbreviated as LDBC, which releases benchmarks for all sorts of graph-specific workloads. According to them, there is two axes, the amount of data accessed and the expected execution time of a given query or a given analytical workload. We started the bottom right with OLTP simple reads and simple lookups, then go up for all app queries, which are more analytical, use more aggregation, do global filtering and multiple joins, and we are at the top graph analytics, which is a graph analytical workload that is the classical network science approach for graph analysis. If you have studied graph algorithms you have probably heard about page rank or connected components or shortest path algorithms, these graph analytics do just that. This is basically a structure-only analysis of a given graph. There are many tools to do this, so you have a lot of patchy tools like Giraffe, Spark, Flink Jelly, the Neo4j graph algorithms library, and so on. So this is pretty well covered. The bottom half is also pretty well covered, because if you perform graph queries that use the structure of the graph, the type of the graph nodes, and the properties of those nodes, you can always use languages like Cypher and Sparkle, and there are many tools for this, including Neo4j, Virtuoso, Stardog, etc. But interestingly, our work fits somewhere in between, so it uses the structure of the graph and some type information, but it doesn't use any of the properties, and it's sort of in between all these workloads, and it seems to me that there are no off-the-shelf solutions to tackle these sort of computations. If you're familiar with the Apache tools, you might think that some of them might work for this, and this is indeed a good question. So there are four popular Apache frameworks for graph analysis, Hamagraph, Giraffe that runs on top of Hadoop, Spark's GraphX, and Flink's Jelly. Interestingly though, I found that these libraries are getting more and more abundant, so I did a little bit of analysis, and I counted the commits for the last two years and for the two years before that period. It turned out that it's pretty much a downward strength, so Hamag essentially is that technology. Giraffe has been abandoned by most of his developers. GraphX is losing momentum, and the only one that is more or less active is Flink's Jelly, but overall it's pretty much a downward strength, unfortunately. A couple of years ago, there was another technology here, which is called the Arabesque Distributed Graph Mining Framework. According to his definition, it does open source graph analytics over the Hadoop system, and its frequent subgraph mining is essentially very similar to our workloads. So this could have been a very good fit, but unfortunately this is also getting abandoned. So even though there are papers and repositories and presentations on it, it doesn't seem to be actively developed. And at this point, we were thinking, well, hang on a minute, what's so difficult about graph computations? And a bit of reading reveals that graph computation is inherently a lot more difficult than any of the relational queries and relational analytics that are performed by these popular data processing framework like Spark DataFrames, because graphs suffer from what's called the curse of connectedness. This means that everything is so densely connected that it's very difficult to parallelize the algorithms, very difficult to do efficient caching, and in general very difficult to extract a high computational value from the computer. The reason behind might be that contemporary computer architectures are more suited towards hierarchical data, think trees, lists, stacks, and this works pretty well for relational databases, but it's not very well suited for graph algorithms. One approach that tries to overcome this is to define programming models. The vertex-centric programming model, or also known as the think-like-vertex programming model, is a popular approach to try to describe graph algorithms in a higher level of abstraction and then translate them to an underlying engine. And the majority of distributed engines use this, but we were thinking that for such a small dataset, we should be able to do this on a single machine with preferably a couple of lines of code, so we investigated a bit further. And this is how we encountered the linear algebra-based definition of graph computations. So the idea, I think, is well known to most of you here. If you have a graph, you can trivially transform it to an adjacency matrix between two edges. So you take a matrix that's a square matrix of n, where n is the number of nodes, and if there is an edge between two nodes, you put a one in that cell, and if there isn't an edge, you put a zero. So this is a trivial idea, and it's also very simple to adapt this to typed graphs, because you only have to slice the matrix and get sub-matrices for each edge type. The key idea here is that if you take a matrix such as this, you can do matrix multiplications, and those will enumerate the various length paths. So if you do one matrix multiplication, you will get the two-hop path. If you do two multiplications, you will get the three-hop path, and so on. So this is one of the first optimizations that we realized. We tried to adapt this to the untyped case, which turned out to be fairly simple. We just took A times A times A, took the diagonal elements of the third matrix, because essentially, if you have three-hop paths that close into their beginning node, you have a triangle. So this is how we enumerated the triangles, and it turned out to be fairly straightforward to adapt this idea to the typed case. So we just did a multiplication between one type and the other, and the first type again, took the diagonal elements, and we got the number of triangles in each node. The number of edges part is easy, so this took care of the more difficult parts. Unfortunately, it isn't that easy, because it turns out that matrices get very dense. So if you start enumerating all the paths in a matrix, it gets very dense very soon. So one of the three real optimizations that we could apply is to stop when at enumerating two-length paths, and then try to close them into a triangle without actually enumerating all of the three-length paths in the matrix. So the idea here is that instead of just doing A times A, we used A times A, the element-wise multiplication A, and then we summed the rows. This can be actually adapted with a bit of tinkering to the typed case, and this turns out to be still much of a challenge, because for the typed case, we had to calculate these expressions for all typed pairs. I had a master's student who worked on this during last year, and in her master's thesis, we have a detailed description of all the derivation rules, and we defined these two optimization steps. The first is the element-wise multiplication, and the second one is the notion that even though we have to do a quadratic number of matrix multiplications, this is an embarrassingly parallel problem, because we can just run these on a different CPU course, and there is no need to use a shared and mutable representation. So this is pretty easy to parallelize. In practice, it's quite straightforward to sort of represent. We do A times A, element-wise multiplication by the other matrix, and now it's pretty easy to see that this is a more sparse matrix. There are only a few non-zero elements in this matrix, and then we multiply it with a vector of ones, which is a shorthand for doing the rule-wise summarization of the matrix, and then we're done with it. So this much for the theory part. We don't thought that implementing this in Java will be trivial. We just go online, grab one of the Java libraries, and it will implement it in a few hours. It turned out that it's not that simple, because most Java libraries are geared towards dense matrices and dense linear algebra, and very few actually support sparse matrices. So I went on the software recommendations stack exchange, and this is my question that started its life in early August, and then it got a number of edits, some upvotes, and then I ended answering it for myself. And basically the answer is that the only Java library that can be used for sparse linear algebra is called EJML, which is an apt name, because it's the efficient Java matrix library. This is still actively maintained, and we found that this is the one that can actually do the computations for our relatively slow 2 million node graph. These are some of the benchmark results that we have obtained. The red one is our original edge list representation. As you can see, it starts promising for smaller scale. These are samples of the Paradise Papers dataset, so we scaled it down to 5,000, 10,000 nodes, and so on. And for the small scale, it was all fine, but if we started to go a bit larger, then all tools kept breaking with the exception of EJML. So this was the only one that worked for the whole 2 million node Paradise Papers dataset, and it actually provided decent performance, so it was less than a minute to load and calculate these type clustering coefficients. Now remember, this was a calculation that didn't complete in multiple days when we started this work, so this was actually four or five orders of magnitudes in improvements. We have released the so-called graph analyzer library, which is our linear algebra based implementations of a number of multiplex graph metrics. It uses EJML with a compressed representation for storing the sparse matrices. Unfortunately, it's single threaded for most metrics, but it uses some parallelism if there are multiple types. And you can just plug it in, and it can work on top of a Neo4j instance, or it can work on top of the Eclipse modeling framework that defines engineering models, and it also works for regular CSV files. The unique feature of this library is not its performance, but that it supports for a number of multiplex graph metrics. There are metrics that are defined for certain nodes, for certain types, node pairs, node type pairs, type pairs, and so on. So this is quite good coverage of metrics that we could get from the last 10 years of literature in network science. Okay, so obviously we had troubles with the Java implementation, and while looking for the library that could work in Java for sparse matrices, we stumbled upon a new language called Julia. So this is actually a very recent development. The version 1.0 stable was released last August, and version 1.1 is just out this year. The idea of Julia is to solve the two language problems. So traditionally, in high-performance computing, people started implementing algorithms in a productivity language, like Python or R. And then once they get the algorithm right, they rewrote it in C or C++ or some low-level code. And the promise of Julia is to be a single language that provides both high performance and a high level of abstraction. And it's also a dynamic language, so it's sort of similar to MATLAB or even Fortranesque if you have used those sorts of languages. And we managed to write an implementation for our TCC problem in a few days, which included learning the Julia language and all its scaffoldings. And it was actually 25 lines in total, and it's pretty easy to comprehend and easy to update. So this is something that we found really interesting. We could just rewrite in MATLAB slash octave syntax the code, and its performance was comparable to the best Java implementations, but it was easier to read, easier to write, and probably easier to extend. Okay, and the third implementation is something called GraphBless. Now, GraphBless isn't actually an implementation. GraphBless is an effort to define standard building blocks for graph algorithms in the language of linear algebra. So the idea comes from the BLAST system, which is about 40 years old, and it's the abbreviation of basic linear algebra sub-programs. The idea there was that we have a lot of different hardware, and BLAST acts upon like an interface between those hardware components and the system components. So it acts as a separation of concerns between the numerical applications that need to do some heavy number crunching and the hardware that will perform the evaluation. And basically GraphBless is the same idea rehashed for graphs. So we have graph analytical applications, those implement graph algorithms that use this common API, and you can have all the hardware vendors try to optimize their stuff to suit this API as well as possible. And, you know, things have changed since 1979. So now we have various Xeon files, we have GPUs, we have FPGAs. So this is a very interesting initiative to try to reconcile all those heterogeneous hardware components under a single API and then use those to perform high performance graph analytics. One of the key ideas of GraphBless is to extend the notion of a semi-ring. So it has been observed that if you take a semi-ring with a multiplication and an addition operation, and you redefine those operations, it's very easy to capture a couple of graph algorithms. This is actually another old idea, so it has been described by the Aho Hopcroft-Urman book from 1974, and it was also described in the algorithm books that most of you have probably read in the introduction to algorithms CLR book. Interestingly, it's only in the first edition because they decided to remove it, but the idea is very simple. The idea is that we take adjacency matrices and we redefine these operators over the semi-ring, and it turns out that it's very easy to translate one algorithm from each family of algorithms to this sort of notion. So if you think like Floyd Warsaw algorithm that's somewhere here along these lines, or if you think of single-source shortest path or BFS, those all can be adapted to this notion. The problem is that even though the idea is old, very few libraries support this. So there is no library in Java that I am aware of that actually allows the users to redefine these operators, and even in other languages C or C++, if you just take a BLAS implementation, it will not support this idea, and it will only work for real numbers, the sort of classical semi-ring that won't cut it for paths and traversals. If you're looking for implementation that does work, there is a suit called Suitsparse GraphBLAS. This offers a C API and a single-threaded implementation. It was developed by Timothy Davis in Texas, and it was used in the Redis Graph systems. So this is actually the implementation that underlines the Redis Graph system. The trouble with this is that even though it's very well designed and it has deep theoretical roots, it has a rather steep learning curve. So as of today, I cannot show you any benchmark results that are reasonably well written and reasonably well benchmarked. We will publish those results once we get the results correct and get a few runs. Okay, so even though I couldn't show you benchmark results for GraphBLAS, I can show you the results on the Paradise Papers. We actually calculated multiple type coefficient metrics. This is TCC1 and TCC2 that look for different wedges and different triangles. It shows that there are some outliers. So there are some nodes in this offshore leak dataset that actually are very clustered according to some types. This is something that is worth investigating. The trouble is that we don't have the domain-specific expertise. So if you do, just please talk to us and reach out because it would be really cool to discover these outliers further. Okay, and I said that there isn't just a systems problem here. There is an algorithmic or theoretical problem when doing graph analytics and one of these problems is skew. Now, this has been recognized in relational databases about five years ago in a Sigma Dracors 2013 paper that says that if you use binary joins for evaluating a number of queries, those queries can be inevitably suboptimal. This is true for cyclic queries and unfortunately, if you're looking for a triangle, that's a cyclic query. So here is an interesting example. We have three sets of nodes and we have one designated node in each group. The groups don't have any edges, but the designated node is connected to all the nodes in the other groups and the other nodes, so the black ones, are only connected to the designated node. So this is like a weird graph and it turns out that if you try to formulate r join s join t, so the join of these three edges in any order, it will always be suboptimal and it will always require quadratic time. The reason for this is that you have to enumerate all of these wedges that we have talked about. So you enumerate a number of wedges here that will not close into a triangle, another number of wedges there and the third number of wedges there and none of these will close into a triangle. So this is actually a pretty deep computer science problem as it turns out. And initially, I thought that maybe matrices can resolve this problem without much thinking, but it doesn't turns out to be the case. So the case is that if we map these two matrices and we do a matrix multiplication, what happens is that we will get this very dense sub matrix. So if we do r times s, which is the same as a binary join, we will get a quadratic number of wedges and it will obvious to take quadratic time. If we apply t, so if we continue with the multiplications, this will eventually cancel out the invalid wedges that do not close into a triangle, but this is already too late. So we already enumerated a lot of wedges and we have burned the number of CPU cycles. So one of the optimizations that could be used here is use t, the green matrix as a mask for the multiplication. And it turns out that scientists in database research have found a solution to this problem as well. It's called the worst case optimal join algorithm family. This can calculate these multivay joints, which are like ternary or even higher order joints in n over 1.5. So there is hope and there are algorithms, but they are very difficult to adapt to practical applications. Of course, you might be wondering whether this occurs in practice. Well, not to this extreme as in the example, but due to the power load distribution of scale-free real networks, it's actually more common than one would imagine. And GraphBlast offers mask multiplication operators, but I don't think there are any other libraries that do. So even in the more advanced libraries like Julia, there is just no way to do mask multiplication, but you have to write your own custom matrix multiplication operation. Okay, so a couple of benchmark results for the regular graph analytics. This is LDBC's Grapholitics benchmark, and it works on a synthetic social network graph, which is similar to what Facebook have published about their distributions. It's sort of a large midsize graph, four million nodes, and it's pretty dense. And the interesting thing that we have found is that basically, SCC is one of the more difficult graph metrics. So this is a paper that was published before I joined the Grapholitics Working Group. And in this paper, they realized that SCC is actually two or three orders of magnitude slower than metrics such as page rank or single source shortest path, which is really interesting because those are the ones that should be more difficult. And if you observe for systems that have a higher baseline, so for Giraffe and GraphX, which are already pretty slow for page rank and single source shortest path, they actually just fail to finish and time out. Okay, so I wrap up soon. A couple of takeaways is that graph processing is hard. So you need to use sparse matrices, you need to use edge types. It's desirable if you can override the multiplication and addition operators. Of course, we would like to run it in parallel and we would like if it wouldn't crash on skewed distributions. I don't think there are off-the-shelf solutions, but I've gathered a couple of building blocks that you could use for C, C++, Java, and Julia. And it turns out that for Python, there is already a wrapper for GraphPlus. So this is all food for thought and tools for experimentation. There are a number of papers. It turns out that last year, people started to do some distributed and linear algebra-based graph analytical framework. So this is, I think, a good step into creating scalable graph analytics. And basically, based on our experiments, if you use Java, just use EJML. If you have the time to learn GraphPlus, then that's a very good investment of your time. And Julia is promising, but has no specific graph support yet. And one confession to the end is that I believe that some Java libraries could have actually worked for Paradise Papers, but we got too carried away with the linear algebraic wizardry. I think the Arabesque library and Flink Jelly and GPS, the graph processing system, could work for a dataset with the scale of the Paradise Papers. As you can see, none of those were included in this benchmark. So they could probably work for something like the Paradise Papers dataset. It would be very interesting to see how they perform in practice. Thank you for your attention. And if you have any questions, we have like four minutes for them. Yes. Okay, go ahead. To me, yes, thank you. Thank you very much for the talk. I would just be provocative because it's easier, sorry for that. Okay. I was just, I'm an engineer and for me, when you say sparse metrics, what I think is just behind is just about dealing with the list of nodes and the list of edges. Okay. If you do math, which I do not so much, your formalism, you're right in terms of matrices, which is fine. But then actually thinking that matrices that are a good idea for computation is kind of weird. It's like taking me very seriously a very different practice, which is the theoretical practice. Yep. How the computer works, which is not like that. And considering that the most efficient that I know algorithm to count triangles, the accumulation of what is actually expressed in terms of the links and edges, not alone in terms of matrices, what is the method to do that even? Well, it's sort of, yeah. So, the question was whether or not matrices are a good idea because there are specialized algorithms for problems such as triangle counting. It's sort of like the common idea of an intermediate language in computer science. So, it's more optimal than someone would write in from scratch. So, let's take our use case. If you only have a couple of days to implement the metric and you don't want to do detailed algorithm design, just use those building blocks and invoke a couple of graphblast API calls or some matrix multiplication creations in, say, Julia, then I think this is a good balance. So, this tries to strike a balance between expressivity and performance. And I think it's a good step because of the hardware area is getting so heterogeneous. Because even if you have a very good edge list implementation, you will not be able to deploy it to a GPU or an FPGA or something that's a specialized piece of hardware. Okay, one more question? Yeah, a geograph organization. Okay. And this can see the implementation of graphs and not the operations themselves. And this is what I think when we do it first, we consider the interface of what the graph is for most people. And then the implementation is up to them or we have some basic ones. If you always use graphblast, for example, you're always going to have the same structure of sparse matrices. So, have you planned to make an interface which would be common to both sparse matrices implementation and dense and or a distance? I have a quote from Sam Halliday who implemented one of the sparse and dense matrix libraries. And Sam said there is actually no such thing as a sparse matrix because the number of matrices that you find in the wide are so big, the number of different types, that it's always going to be a custom solution for a given sparse matrix. So I have a quote to say that you are right. It's probably if you have a power low graph or you have a mesh graph, you will need the different algorithms, this different algorithm if you want to extract the last ounce of performance from your hardware. And I think this is an open problem. So probably what we should do is create various sparse matrix representations and optimize various operators according to the distributions. And that's probably closer to the optimum than the current approach. Okay, thank you again.