 Hi. So, initially, another colleague of ours was planned to give this talk, Paul Horn, but he's unfortunately sick, so I'm here instead. My name is Max. What you know is that we all work for Neo4j, specifically in the Graph Analytics team. And today we want to present you one new or not so new project that we are working on, specifically, we are working on this project, and that is the Neo4j Graph Data Science Library. Some of you might already have used or not have used the Graph Algorithms Library. Basically, this is the successor of it. I'll go into the details about the history a bit later. For those of you who don't or haven't used it so far, what is the Neo4j Graph Data Science Library? It's basically an add-on for the Neo4j database, which is a graph database for those of you who don't know. And we are an add-on for that that allows you to run graph algorithms. So, the Neo4j database itself is more of an OLTP database, so it's mainly used for short queries that query only smaller parts of a graph using mostly the Cypher query language. There is in-with-in Cypher support for some algorithms like shortest path, for example. But the Neo4j Graph Data Science Library gives you a much broader access to global graph algorithms. And we have implemented a broad set of algorithms, or there exists a broad set of algorithms. We have algorithms for community detection and clustering, for example, label propagation. We have algorithms for similarity calculation. So, for example, you can calculate how similar are two nodes given their neighborhood. Therefore, we use node similarity using jacar similarity. We have centrality algorithms such as page rank. We have some link prediction and pathfinding algorithms. Pathfinding, we have dykstra A star. And then we have some auxiliary stuff like graph creation for the in-memory graph. And another thing the library basically provides is an API for implementing your own algorithms. There are certain ways you can do this. The easiest way is using the Pregil framework. That was introduced by Google quite a while ago, so we also support that. And that allows you to easily write your own algorithms and then also make them available in Neo4j through Cypher through the Procedure API. So, when would you use the graph data science library or GDS as we call it because graph data science is pretty long? Oh, no, no, that's otherwise. First, we start with some of the timeline information. So, as I already said, initially this project started as the Neo4j graph algorithms library, which was a project that was spearheaded by the Neo4j deferral or LAPS organization led by Michael. So, he initiated this project and it was spiked by an external contractor, Avangard LAPS, Paul, who was initially supposed to give this talk. He spiked this project. He's been working on this project for quite a while. And as you can see, this was already started already in Q1 2017. And then they kept on working for this for a while. The library was already open source and public. By then you could use it. And at some point in time, the Neo4j product engineering took actually over. So, we came into the picture to actually make it a fully-fledged Neo4j product. And that work started in Q1 2019, so last year. Since then, we've been working on mostly productizing the library, doing some code refactoring, making sure that all the algorithms run in all the ways you would expect them to. And also so that they have basically the features that users want them to have, which is not to say that when Michael and Paul and the other guys at Avangard LAPS developed that they didn't focus on that, but we talked to potential customers and then looked what they actually wanted for algorithms. And then took a smaller subset of the algorithms that already existed and made sure that they run in the best performance that we could make them run. And now we are at basically today, Q1 2020. And we are or have basically re-open sourced the project. So, while we were initially developing it as Neo4j product engineering, we had closed sourced the project for a while and we have now open sourced it again. The link for it I'll show you later. And we will make a first pre-release, I think, next week. And then in Q2 2020, this will be globally available through the usual channels, mostly the Neo4j desktop if you want to try it out. So now, when would you actually like to use that? So we basically make the decision between, as I already mentioned, more OLTP focused or like online transaction focused workloads where you have small short lift queries that you basically run over and over again just with different parameters. For example, if you want to drive your website, then you would do more OLTP-style queries where you only touch a very small subset of your graph. And this is where actually Neo4j with the Cypher language like performs on its best. That's not to say Cypher and Neo4j both support also queries that can do more analytical or can support more analytical work cases or workloads. But from both the query language and mostly Neo4j, it performs at its best when it's OLTP. So short lift graph localized queries. And then we have the graph algorithms library, which is more when you want to analyze the entire graph. So you're not really knowing what you're actually looking for. So you have basically an idea of what you want to analyze or maybe you have an idea of a hypothesis and you want to prove that hypothesis. Then you can use the graph algorithms library and discover new insights in your graph by doing graph global aggregations. So as I said, when you run more Cypher-style queries, then you only use a very localized part of your graph and you actually know what you're looking for. You know that I'm looking for a person named this and that and I want to know its neighborhood. But when I'm using the analytics library, then I'm basically using all of my graph or a very big part of it and I'm basically digging into the graph, finding new information, finding new insight and then maybe feeding it back and that's where you call it graph data science library. Maybe feeding it back into my data science workflow that is now aided by graph processing. So how does our library work? How would you use that for those who haven't basically used the graph algorithms library? So currently we always start off by a Neo4j database. So you have your graph loaded into your Neo4j database and it's living there. And here I'm using a graph which basically I use the colors to denote nodes with different labels, for example. And what you then do is because inside of the graph data science library we are using an in-memory model of the graph to basically empower those very fast graph algorithm runs. We can't really rely on the layout and the store layout of Neo4j. So what we are doing is we are loading the graph from Neo4j into an in-memory model. And that's what you would always start with. You load your graph, you can project the graph. So that means you can rename labels, you can skip certain labels. So you can skip nodes with a certain label, you can skip relationships, you can rename relationships and also you can load certain properties or not load certain properties and you can even reverse the direction of edges into your in-memory graph so that I've visualized this and we've basically skipped the red nodes in this visualization. And now we have our sub-graph in-memory. This graph now lives in a thing that we call the graph catalog. Basically it's just what you would imagine. The graph basically has a name and you can retrieve that graph by the name as long as you stick inside of the graph data science library. So we've loaded this graph now into memory and now it's time to run an algorithm. So here for example we do some clustering in my example. So you specify now I want to use this in-memory graph and I want to run for example a clustering algorithm and then what you get is you get another in-memory graph or some information about that graph that says, well I think those nodes belong to a cluster and the other nodes belong to a cluster and then as a next step you would consume this result. You could run another algorithm on that graph maybe using the information from the first run or you consume the information that the algorithm returned to you and you can do this in two ways. You can either write the results back into Neo4j. This mostly happens by writing a node property. So for example if you do the clustering we would write a property on every node that says well this node I think belongs to cluster with the ID one or you can for some algorithms like the similarity algorithms you would write back a relationship to the graph and another option is you can stream back results. So if you don't want to persist the result of the algorithm you can just stream back the results inside of your Cypher query and then use it in the driver or your application program as you see fit. So these are currently the two ways that we are supporting and the most important thing is the workflow always starts currently at Neo4j then you load a graph you can load as many graphs as you like or as fit into your memory and then you can use them for running algorithms and then you can consume the result. So what kind of algorithms do we have? I already hinted towards that we have community detection to find clusters of nodes that somehow belong together we have centrality or importance of certain nodes patronage is probably the most well-known there or between the centrality similarity algorithms link prediction and path finding and search. Those that are currently marked in bold font here are the ones that we have actually productized by productizing we mean they here we guarantee some form of matureness of the algorithm by our team and by Neo4j basically that is if you use those parts we hope that those are bug free run as fast as they can all these things. And I think it's time for a demo now. So I actually, where is my mouse? I have this Neo4j browser here which is huge I'm sorry for that but I can't really change that. Can I zoom out? No. Okay so what I've loaded into this graph is basically the Game of Thrones data sets that we were using for many examples so it's basically an extract of how people in the Game of Thrones books when people appear together on a page then we basically create nodes for those people and then we link them together with an edge and so we have edges interacted with so people interacted with another in a certain book and then we have those interactions also split up into these different books. We have some more information that we're actually not going to use in this demo. So this is actually the most important information we have persons and then they act with one another. This graph is loaded into this Neo4j database. We can run a simple query here to see to get a bit of an idea of how this graph looks like. This basically just counts the average number of interactions between the people so an average people interact with another or people interact with 6.7 other people in the book the person who interacts with most of people has 170 interactions. That's just to get an idea. I think it's about 5,000 nodes and slightly more edges in the graph. Now we can dive into the Graph Data Science Library. As I said, the first thing we always do is we have to load a graph. That's horrible. What we do is we use the Cypher procedure language here because as I said, we are a plugin so you can use the Graph Data Science Library via procedure calls. The first procedure call that we would do is we would say GDS, which every of our commands are prefixed with, then graph, that's basically all the catalog commands and then creates because we want to create a new graph. What we say here additionally is we want to call this graph GOT interactions because from the GOT graph or Game of Thrones graph we load all the interactions. Then we specify a node projection and that means what kind of nodes do we want to load. We have a very simple form. You can specify many more things if you like but here you just want to load every person that is in the graph, so every node that is labeled with the labeled person. For relationships, we want to load all the interactions relationships and then we have a thing that we call projection. That means because we need to know in which direction do we want to load the relationships because of how we store them internally, we need to know do we want to have them in their natural direction that is how they are in the database from one node into another because Neo4j every node is directed. Do we want to load them in reverse then we would say reverse or do we want them undirected and then undirected we basically store them twice in every direction. That's what we are doing here so we basically get if one person interacts with the other person interacts with the one person. So we load this graph, run the query and then we get some information back about what we just loaded. I mean that's basically the same thing. We named the GOTA interactions. Here's what I said we just specified a very simple form of a node protection. Here's the extended form and then it gives us some information so we have only 2000 nodes now, 7000 interactions or relationships and it took 67 milliseconds to load that. Right. So now it's time to run a first algorithm. We're going to use Patreon for that. Where am I? I guess most of you do know I can't find my way back. No? There it is. So we are running the Patreon algorithm now. Patreon as I said is a centrality algorithm so it gives every node a score about how important is it to the network. And here again we can basically demonstrate how our algorithm syntax looks like so we have again GDS, our general prefix and the algorithm we want to run and then the mode we want to run on. As I said there are two different ways of how we can consume algorithms it's either by writing or by streaming. Here we stream results back to the user. We say we want to run this algorithm on this graph in our catalog and that's basically it because we don't want to specify any more or any more detailed things about the algorithm we just run with the default settings. And when we run it we almost immediately get some information back. In this case we see that Jon Snow is apparently the most important node in our network with a patron of 17-point something followed by Tyrone Lannister and Cesar Lannister. For those of you who've read the books that kind of makes sense at least to me they are the most important characters in the book series. So at least it's returning what we would have expected. To show a more detailed example because we're running out of time we have no that doesn't work because I think we don't have much more time for more examples. So let's switch back to the demonstration. To the presentation I mean. Again just a short wrap up of how the algorithm syntax looks like we have always a call because we want to in cipher we want to run the procedure we have the algorithm name and then the mode we have the graph name and then the configuration so you can specify many more things for the algorithm for patron for example you can specify the dampening factor or whatever configuration you have or when you write back you can specify into which property you want to write the patron value we have usually three different modes we have write we have stream and then we have a stats mode which basically just returns the metadata about an algorithm run basically just as writing without the writing but it gives you some information about how long did the algorithm take to compute maybe some information about distribution of the result here's another example with weekly connected components so here you can see you can specify many more properties or configuration options for your algorithm run and with that I want to thank you for listening as I said we have open source to repository just last week I think you can find it under Neo Technology just graph analytics on github currently it doesn't build it is green but we haven't released anything of our build configuration yet so if you have a problem we are working on that soon you will be able to also build that but at least you can already have a look at the code hopefully at some point start hacking on it and trying it out thank you questions we have time for maybe two of your questions is the data science library accessible from like python or other languages so the question is if it's accessible from python or other languages it is kind of through cypher I mean you can use the python cypher driver and then what you are writing is basically just cypher queries you can prepend or append cypher statements around your algorithm as you see fit there's currently no way of basically writing your own algorithms in python we've been thinking about that we've experimented a bit with that but we don't have anything that is working in that direction currently but there's a wrapper for there will be more in depth next Tuesday in Amsterdam I don't know it's actually slash neo4j slash graph data science if you want to go to the public version of it and if you didn't get the chance oh private one you will get a 4 or 4 when you are done and if you didn't get the chance to get the book it's also available online free for download so the book that was around here just go to neo4j.com and you can download the copy of the book which explains all the algos that are part of the library and with the pre-release that Max mentioned for next week we will also do a docs pre-release that explains all the API that he just sketched over in of course very much more detail so yeah thanks again Max