 The cameras are filming you. You're a wonderful guy. Okay. Welcome back everyone. I'm really happy to welcome Frank here too on Graph.dev. He's from Pivotal and we'll hear about, hear about graph analytics on scalable databases. So welcome Frank and enjoy the talk. Thank you very much. It's a real pleasure to be here and thank you everybody for coming at the end of the long. Okay. Can you hear me in the back? Yeah, okay. Okay, I'll try to speak up. I'll try to shout. Right, so thank you very much. It's really my pleasure to talk to you about graph analytics on massively parallel processing databases. This talk is in the context of Apache Madlib, which is an open source project in the Apache Software Foundation. But I hope that kind of some of the ideas are more general around running graph algorithms on parallel databases. So I've been working primarily in the area of machine learning on parallel databases for a while. Graph is something kind of that's relatively new for us. At least on products like Green Plum Database and Apache Hawk, which are both open source MPP databases. Apache Hawk is similar to Green Plum. It's an MPP on SQL. It's an MPP on Hadoop, rather project. So graph analytics is becoming a very important part of the enterprise, a lot of academic work. But in the last 10 to 15 years, it's seen like a huge surge of interest in the enterprise. And as it started to become more popular in the enterprise, folks have looked at using specialized graph engines like Neo4j. We've heard about some of them today, which are great products. The thing is a lot of the companies that we work with, SQL is their primary workload. They have expertise in SQL. They have investments in SQL. That's where their data is. So they started asking us, what can you do on these database with respect to graph? Because graph is such a common representation for real world problems. So that's the focus of the talk today. I thought I would start with a slide from DB Engines that looks at database engine popularity. And you can see the usual suspects here in the top 10. And the first graph database, Neo4j, is around 21st position. And there's about four in the top 200. So that's not to say that these are not used and are not really great products. But when it comes to the enterprise, they're less present with respect to the trends of these databases over the last four years. This is Neo4j at the top. This is Titan. This is Giraffe. And this is GraphDB. So they're certainly growing. This is a log scale. So it looks like they're flatter than they are. They're actually growing significantly, although maybe flattened a little bit in the last year or so. So graphs, as you know, represent entities in their relationships. They can help us understand connections. And they can be very small and simple, like this group of friends here. However, most real-world databases are real-world graph problems are actually very large. And we've heard about some of those today in the previous talks, which is very interesting. This is an example of a visualization from LinkedIn in which is showing the connections for a particular person. But these kinds of graphs exist in computer networks, social networks, internet, transportation networks. So the question is, can we have a computing platform that can reason over these large graphs efficiently and then do analytical operations at scale? So the idea about massively parallel processing databases and using graphs on them, well, I mean, does that make sense? Should we even go down this avenue? So here's a couple of points that may lead us to believe that it might not be a bad idea to investigate. First of all, these databases are really built for scale. So they've been devised from the ground up for very large data sets. Green Plum Database has been around for 10-plus years. And it has really thousands of person years of investment in designing and kind of handling these large data sets. One of the things that I've observed and speaking with our customers and speaking with the data science folks that I work with is that many enterprise use cases combine graph analytics with other kinds of techniques. So there's the context specific part from graph, but also the non-graph methods that together combine to give the kinds of practical end-to-end solutions to problems. So that's another consideration. I mentioned SQL earlier. SQL is the most common workload in the enterprise. It's used very widely. Data scientists know it along with Python and R. Analysts know it for sure. And there's also a whole ecosystem of business intelligence tools associated with SQL. So many enterprises have invested heavily in that. They're running Tableau for visualization, many other sort of BI products. Another issue is around data locality. So there's a cost for replicating, moving, transforming data to external systems. So when we're talking about very large graphs and very large sets of data, the cost of moving really is a consideration. And the other thing which is, I guess, really particularly important in business is policy. So there's a cost to procuring separate database engines, running them in production, doing all of the oversight and training that are required for these database engines. So what it means is that you need to convince the chief information officer that if you really want to use Neo4j, or you want to use Titan or another graph database, that needs to get put into production. So there's a whole overhead associated with that. And especially the customers that we work with in financial services and government and health care, they have very stringent recommendations around this. So we said, oh, it'd be a good idea to do this. But can you actually efficiently perform analytic processing on relational data in an MPP database? So here I put the answer is yes. It's not a blanket yes, because I think there's a class of graph algorithms that work effectively in MPP land. There are others that don't. And I'll talk about that shortly. But with Apache Madlib running on Apache Hawk or Green Plum database, we've been able to solve for some pretty interesting kinds of use cases for our customers. And I'll shortly give an example on cybersecurity that we've worked on recently. So I want to say a word about Apache Madlib. It's an open source machine learning library. It runs in database. And what that means is that it's bringing the compute to the data. So you don't need to move the data out, operate on it, and then move it back into the database. And really the main value around Apache Madlib is scalability and performance. If you have smaller data that fits in memory on a single node, then there's a lot of other options for machine learning and for graph as well. You can use R. You can use a lot of the Python libraries. There's commercial products as well in this area like SAS. So Madlib is really focused on scale and performance at scale. It came out of work that was done initially at UC Berkeley in 2011. A very smart gentleman by the name of Professor Joe Hallerstein worked on this along with folks from what was EMC then, EMC Green Plum. But since that time, it's been a collaboration with Stanford University, University of Wisconsin-Madison, and University of Florida have also contributed. And in fact, some of these ideas about university, about running graph and relational databases, some of that academic work was done at University of Wisconsin-Madison. So you can look up some papers there by Professor Jignesh Patel. So whoops. Oh, there. Oh, let's see. Oh, there it goes. OK. So Madlib's been around for about five years or so. It's a pretty expressive library. It's got 35 to 40 functions. And you can see here the release that's coming out in a couple of weeks has the first implementation of graph single source shortest path. We've prototyped a bunch of other algorithms, which are going to go in in a subsequent release. But what we're doing is we're interested in adding this certain class of graph algorithms to Madlib. The way Madlib works is that it's SQL based, so it's kind of a declarative function call within a select statement. This is just an example for training a linear regression model. So you give it an input table, an output table representing your model. Here you want to predict house prices given property tax, number of bathrooms, size of the property. And you want to build that model by number of bedrooms. So there's this idea in machine learning about training and then prediction. So you train, you build this model. Then when you get new data in that you want to predict, then you have an associated prediction statement for that. So this is kind of the idea behind the way Madlib is set up with respect to SQL. So how does it work in the database? So Green Plum Database is a shared nothing parallel system. So you have SQL comes in to a master. There is a query optimizer, which figures out how to parse that SQL and distribute it to the worker nodes. The worker nodes are called segments. Segment is where all the computation happens. And it's where the query processing happens, where the data resides as well. When you load data, you load it into the worker nodes. So this is not just Green Plum, but other MPP databases out there look something like this. So the way Madlib looks is that you can think of it as a kind of layer on top of Green Plum Database. It accepts SQL, R, as well as Python. It does input validation and preprocessing here. And then it actually runs machine learning algorithms in database as stored objects. So you can think of it as a layer on top of Green Plum, but also compiled with the database. OK, so how do we do graph then in this world? So here is our old friend a directed graph. We have vertices or nodes. We have edges. And the edges in this little example have weights. So the edge weights can be positive. They could be negative as well. So the way we represent this in Madlib is we have a vertex table and an edge table, similar to what we just heard with the SAP presentation earlier. So in the vertex table, you have the vertex number. Could be a bunch of parameters associated with that. In the edge table, we have the source. So we have a source vertex, destination vertex, edge weight. And then there could be a bunch of other edge parameters as well. So the single source shortest path, I kind of walked through how that was built, some of the mechanics of how that was implemented, just to remind you of what this is. Given a graph and a source vertex, so the single source vertex, you find the path to every single vertex in the graph such that the sum of the weights of the edges is minimized. So here, there's multiple paths to get from A to F. But the shortest path is to go A, C, E, D, and F because that gives you the shortest aggregate edge weight. There's another path here through B, but that's not included because that would have been a longer path. So although this is a really simple algorithm, it's actually very useful. It can be used for many things like vehicle routing and navigation, degrees of separation in a social network, also in telecommunications, minimum path delay in a network, plant facility layout, VLSI design. So many, many different kinds of applications of what is actually a pretty simple algorithm. So the implementation that we did in MATLAB is this is a performance curve. So we use something called a Bellman-Ford algorithm. Bellman-Ford algorithm allows you to support negative edge weights, but not negative cycles. And it runs in order of number of vertices times the number of edges in the worst case. Normally, it's much less than that, but that's sort of the kind of worst case configuration of a graph. So this is performance testing that we ran. First of all, it's running on a fairly small cluster. It has a single master, and it has four worker nodes. And on those worker nodes, they're multi-core nodes. So there's multiple segment instances that are running on those nodes. So you could say that it's a four node cluster, but there's actually 24 worker nodes. So this is number of vertices here. So up to this is going up to 500,000 vertices. This is the runtime. This is about 45 seconds. So this was fixed 100 edges per vertex running in a relatively recent release of Green Plum 4, 3, 10. So that, at the end, there corresponds to 50 million edges. So it's not like a huge graph, but fairly significant. And it's showing linear or sublinear run time increase with number of vertices. So in the worst case, this would be a second order. But because we fixed 100 edges per vertex, it shows up more as linear. So this looks kind of faded. Probably can't see it from the back. But the way that you call it is like in a select statement as well. So there's a declarative function, kind of a built-in function you call graph SSSP. You give it a vertex table. You give it a column where your vertex IDs are. You have an edge table. You provide the source for a text, and it will generate the shortest path. And then from there, if you want to retrieve a particular path, then there's a function to retrieve the path. So I wanted to talk a little bit about implementation considerations, which maybe go beyond just the single source shortest path, and which go beyond Green Plum and bad luck. This is around graph 1 MPP. So I think with respect to relationships, considerations on which algorithms to implement and how to go about implementing those algorithms, relationships are not a first-class citizen in a relational database, at least from the graph perspective, unlike, for example, in Neo4j, where the graph structure looks very different than the edge and vertex table. So it means that join operations are intensive, right? They're expensive from the perspective of compute and memory. So you really want to minimize those. For example, yeah, so if you're doing multiple relationship kinds of joins, then you need to think about whether the MPP database is really the right place to do that. Table scans is another consideration. So we use Bellman-Ford in that single source shortest path. If we would use Dijkstra, Dijkstra is another kind of implementation of shortest path, which is more depth first versus breadth first. It's much slower. So you want to reduce the number of table scans. So depth search kind of algorithms are expensive. So you want to try to do breadth first sorts of search, as well as greedy kinds of algorithms that don't take advantage of query optimization are going to be a problem. So query optimization is your friend because a lot of time has been spent on thinking about how to run distributed queries very effectively in MPP architectures. So if you're specifying that you want to scan a table and look for vertices or edges in a certain order, then it's going to be much more expensive than letting the query optimizer do that for you. Implementation considerations, right? So on database limits, databases have some unusual things about them. Both Green Plum and Apache Hawk are derived from Postgres. So they were forked from Postgres. So we inherit some of the limits that Postgres has. For example, the maximum working set size, the maximum set of a cell in Postgres is one gigabyte. So if you're doing iterative algorithms that you're storing parts of your graph or aspects of your graph, then you need to think about these limitations. So before I get to our cybersecurity example, I wanted to say a word about roadmap. So these are the kinds of things that we have. We've actually prototyped. Most of these are ready. And then we'll be contributing them to Apache Madlib pretty soon. So next on the list is all pairs shortest path. If you remember, we did single source shortest path is pick one source, find paths everywhere else in the database and the graph. All pairs is like, what is the shortest path between all nodes in the database? So there is an algorithm, Floyd Warshall, that runs in the order of number of vertices cubed, which we're looking at. The reason this is interesting is because once you have all pairs shortest path, you can get some really interesting measures, like betweenness. So betweenness is a way of, betweenness centrality is a way of finding out influencers in a network, because that's where the shortest path, most of the shortest paths run through. There's also things like graph diameter. Graph diameter is very interesting because what it allows you to do is show the reach within a database, which is really important in the cybersecurity example. I'm going to show you. Page rank is kind of a well-known one to identify importance of vertices. Connected components for clustering, common components both strongly and weakly. So these are pretty good candidates because of breadth first as well. Graph cut is an important one that we've been looking at. The reason that's useful is that sometimes your data, your graph is just too massive. So you need to cut it in a kind of sensible way and then operate on subgraphs. But that one is harder to do. So I'd like to finish with a cybersecurity example that we've been working on in lateral movement detection, which hopefully can kind of illustrate some of the points I've been making. So the idea behind this example is that perimeter defenses is inadequate, which is to say you can't base your network security on keeping people out. You have to assume that people will get into your network and you want to identify them in that case, right? So threats can come from within. So here is sort of a path in which this is known to happen. It's called the advanced persistent threat. And this is the chain. So there's a phishing attack or a zero day attack in which, say, a user's machine is compromised. Then a third party machine will then come in the back door via this compromised machine. And then there is a lateral movement that happens. So once the bad actor is in the network, they may expand their footprint, expand their graph diameter to elevate importance, look at financial data that may be available, whatever, hunt around, then gather data from these servers and exfiltrate. So this is kind of a common path. So I want to talk about this lateral movement idea. So lateral movement, though, it's a data science problem. In fact, a lot of cybersecurity problems now are data science problems of which graph plays a part. You want to look at users and user behavior models. And you also want to look at networks and sensors. So what users are touching which machines? So the way that lateral movement detection works is, as I described before, a multimodal problem. So you have logs coming in from Active Directory and other sources, server information, lands in a data store somewhere. So this could be an object store. Or it could be on Green Plum database or external tables somewhere. Could be a do file system. Then within Green Plum, there's actually multiple models that are run. I'm going to show you a couple of those. But I just listed them out here. The data scientists that we work with, they run regression models. They run clustering models. They run recommendation models, user behavior models. And they run graph models. And what they do is produce a net score by user on a daily or weekly basis. And that net score is kind of an aggregation of all of the activity from that particular user. This example, there was 10,000 users and 60,000 servers. So not huge, but pretty big. So here's an example of a regression model. I know you can't see this at the back, but this is weak. And this is the number of servers that are touched. So there's a big spike that happens here. So if you run a simple linear regression and look at changes in slope over time, that's one indicator of a change of behavior. User behavior model is another one. User behavior models, they use, in this particular case, a machine learning algorithm called principal components. But in the heat map, you can see here, this is weak as well. And this is server. It could be server group. So the heat map is showing that this user is using server one quite a lot. And they're using server nine quite a lot. But then they hit a point here at which they're using many servers. So they begin logging into a lot of servers. So this machine learning model, which is building up this behavioral model, will then score that accordingly. I won't go over the others, but then you have a graph model as well. So the graph model is built on a weekly basis by user. It uses these historical windows to build the graph. It looks at which machines does a user log into, which machines do they log in from, how often, et cetera. So in the typical behavior, they log in from, say, their home machine to this machine, and they touch these servers. In the anomalous behavior example, they log in from this machine, but they also go directly to other machines. They go to this machine. They go to a machine that they've never been on before, which has a database with financial information. It's a different kind of footprint. And the reason this is really useful for lateral movement is that graphs are sensitive to direction. So they have direction, order is important, frequency is important as well. Because as you have increased frequency, your edge weight goes up. That expands the graph diameter. So the kinds of graph measures that are important to look at here are things like centrality measures, like this between the centrality that increases unexpectedly, and also how graph diameter changes over time. That's really important thing by user to see if there's any kind of basic behavior change. So then all of these models can be combined, and then given a score by user. So the Patrick Madelabe, we're on our fourth release that's moving towards top level status. Anybody is welcome to join us on this project. So the thought I want to leave you with is MPP databases can be effective for solving a certain class of graph analytics problems at scale. And thank you for your attention. Two quick questions. So thank you very much. And from the example you showed, is this calling the function that's like to materialize the data first? Firstly, you can push down these operations into the nodes. Yeah, it pushes it down. I didn't go into the detail of it. But the way the flow works is if you're on sort of a client here, the SQL then is sent to the database server. And then the compute actually happens via stored procedures on the worker nodes. So yeah, the heavy lifting is done where the data is on the worker nodes.