 This is the last talk at this conference and I'm happy that you still show up. This is about graphs. Don't worry, I will say in a minute what I mean by graphs. But let me first, before I start, say a tiny few words about OrangoDB. You might have seen my earlier talk today where I was focusing on the resilience of the OrangoDB agency. But OrangoDB is actually more. OrangoDB is a multi-model database in the sense that it supports multiple data models at the same time. And this aspect of being a graph database is just one of the data models. So you can use OrangoDB simply as a scalable key value store. Or you can use OrangoDB as a document store where you can store JSON documents. And again, it's highly scalable. But you can also now use secondary indexes and do geo queries and do joins and so on and do run transactions. And you can use it as a graph database. And what's more, the fact that it's a multi-model database means that you can not only use it in very different modes, but you can mix everything. Just to give you an example, it's absolutely possible within a single query to use some secondary indexes to get some documents. We then, as the vertices in the graph, for each of them, run a graph traversal as we will see in a second. And then take the outcomes, the results, and run a join operation with yet another collection and put out the results in whatever form you want. So that is what I mean by multi-model. But today, I just want to focus on the graph database aspect. Before we go into details, let me say a few words about myself. I work for OrangoDB and I'm responsible for development, for architecture, but also obviously I go out and give talks and engage with the community. Within the development, I'm responsible for the distributed version of OrangoDB and I recently helped with the so-called smart graphs feature, which I will explain towards the end of the talk. So, I'm a distributed systems engineer, but originally I'm a mathematician and I've done a lot of computer algebra. Right, graphs. So, when I talk about graphs here, I don't mean the graph of function plots or whatever, but I mean a graph consisting out of vertices and edges. So, the vertices are depicted here by the boxes. They are just some abstract dots or vertices. And they are linked by the edges. So, the edges in this graph would be the arrows pointing from one vertex to the other. And the abstract notion of a mathematical graph is just that. Vertices and edges, and in that sense, it's just a relation because you relate the different vertices to each other by means of the edges. So, for example, the upper left corner is related to the upper right corner because there is an edge or an arrow going from one vertex to the other. Usually, when you work with graphs, you have even more because you have some kind of labels next to the vertices and you sometimes have labels next to the edges. One usually speaks about a labeled graph in this case or about a property graph and that is what we deal with in the RunWDB. After all, the RunWDB is a database where you want to store data and if you use it as a graph database, you can not only store graphs, the structure of a graph, but you also can store labels or JSON documents with the vertices and you can store labels or documents with the edges. And that is actually how it fits into the multimodal nature of a RunWDB. You store a graph in a RunWDB by simply having a vertex document with each vertex in your graph and an edge document for each edge in your graph and the rule is that every edge has to say from which vertex to which vertex it goes. And what makes a RunWDB then a graph database is that you can do graph queries. I'm coming into this in the next few minutes about what I mean by graph queries. So, obviously, the interesting part of the graph is poking around in it. So, what you can do for queries, you can essentially go along edges, but you don't have to specify exactly how many steps to go. So, for example, you can do a graph query which says start in the upper left corner and go two to five steps in any direction and tell me what outcomes you can get. Or you can say take an arbitrary length of steps. Or you can do things like give me the shortest path between two vertices and shortest can refer to some value which is given to an edge or something like this. So, a route planning application with a street network would be modeled with such a graph. Let me give you a few examples of such queries. One prototypical graph which comes up very often is a social graph. So, here the vertices are people or represent people and the edges describe friendship. In this case, it's not possible that Alice is a friend of Bob but not the other way around so it's symmetrical and therefore I have depicted these arrows by going in both directions. Now, a typical question for a graph in this context would be give me all the friends of Alice and the answer would be would I have drawn white here. So, the answer is in this case a set of vertices or if you want Alice is going from Alice to some friend. Another question could be give me all the friends of the friends of Alice and then go two steps. In this case, for example, Eve and Frank would be the answer and I could ask what's more questions. I could ask give me the friends and the friends of friends but nobody twice or something like that. Another problem could be take two vertices say Alice and Eve and find the shortest path between them which would be, for example, this here. And as you can see here the answer is not necessarily unique but what we are interested is the path which is only two steps and usually not the path which is one, two, three, four steps going all the way across Dave and Frank. So, shortest paths and with the shortest path you can already say if you just start with two vertices and it is unclear a priori how long the path will be and this is in particular the place where graph databases shine because they are prepared for such questions. They also can do very efficiently fixed length paths but they also can do arbitrary length paths. Another example would be a train network. Imagine the vertices in this graph are train stations and obviously we would probably store some data with the train station and imagine you are in the bottom left corner you might want to ask the question what stations can I reach with a ticket which can go at most six steps and then the answer would be this here, for example and in this case the answer is only six steps but you don't know how many paths there are and obviously a database is prepared to give you an arbitrary number of results but there can be complicated questions and I am going towards other sensible use cases here for example I can ask for all users who share two hobbies with them so in this graph I have modeled my data in the way that the persons are vertices and the hobbies are vertices and the relation depicted by the edges says this person performs this hobby and so here a natural question is give me all the people for Alice who share two hobbies with them in this case this would be just one friend but obviously if there are a million users then not all of them would share two hobbies with Alice or something like this let me just read it out give me all products that at least one of my friends has bought together with the products I already own ordered by how many friends have bought it and the products rating but not even 20 of them and this is rather complicated it just tries to make a point you might be modeling an online shop with this kind of data structure where you just record which products have been bought by some users, by some customer by having an edge pointing from the person to the product additionally you might have this friend relation in this graph and if you want now to find recommendations what you want to put up on a recommendation page for Alice you might want to look at what she bought what her friends bought and maybe recommend something to Alice which one of her friends who actually bought something which Alice also bought has bought so recommendation engines in e-commerce platforms are very typical application for graph data so in this case it would probably find this product here now what happens by the lot is that you analyze your data problem like in this online shop and recommendation engine and you say well this can be modeled very nicely with a graph and surely all my queries will be graph queries imagine you have set up an online shop like this and suddenly somebody says I need to query the database for all users whose age is between 21 and 35 suddenly this question about the age has nothing to do whatsoever with your graph structure if you now use a pure graph database you are somehow lost because a pure graph database would be very good at answering graph questions but if you suddenly have a different question like this here you are lost that is why we have designed the Rangody Beam to be a multimodal database such that even if you have modeled your data as a graph and have stored it in the graph database Rangody Beam you still can put a secondary index on the age attribute of your users and it's absolutely no problem to run a query which completely ignores the graph structure and gives you the right set of people and even better you can then once you have selected a few users by such an attribute you can then ask graph queries for each one and do this in the same query that is actually the reason why we have chosen not to use SQL for the query language for Rangody Beam because we would have to have extended SQL by graph features which would have been a pain also we would have to extend SQL by features dealing with JSON documents so therefore we decided to have our own query language which is called AQL, a Rangody Beam query language which allows to work conveniently with JSON which allows to mix graph queries, joins, transactions and document queries and where the query optimizer is aware not only of the distributed cluster structure of a Rangody Beam but also of secondary indexes and therefore can optimize the execution of your queries or you want to have an aggregation you want to have the edge distribution of your users again, it doesn't have anything to do with the graph structure and sure what you can do is you can store your users once in a relational database and a second time in the graph database but then you have fairly difficult problems of keeping the two different data stores in sync in a multimodal data store you can just do everything in the same engine you can group users by their name, for example but now let's come back to these graph queries how do you actually run them how do you actually determine the shortest path or how do you find all the neighbors or how do you find all the paths of a certain structure well, it all basically comes down to the following algorithm which I'm going to explain just with an example if you are told to investigate the vicinity of the vertex S then the graph query engine does the following it starts with a vertex and asks itself what neighbors does it have so it finds all the outgoing edges from S and with them the neighboring vertices A, B and C that it can do in a very quick way I'll come to this later and then it picks one of these neighbors for example A, I forgot something to say now the query might do some filtering it might want to ignore certain types of edges so the moment we go this first step we might throw out the edge which points to C and we don't have to consider it anymore but if for example two edges remain we have to deal with the vertices A and B after we're done and in this particular algorithm we would concentrate first on A and apply the same procedure again so we ask what are the outgoing edges from the vertex A this gives us another two vertices with the edges pointing there and again we might do some edge filtering and then we would go to E and see how we can move on or alternatively if the query says we only go two steps then we have finished and have found a path from S over A to E and so one of the items we return is this path now having put out this result we have to backtrack because we have now finished vertices A now we have to go back remember that we have still something to do with B come back to B and find its neighbors do edge selection and if this edge survives we now report the second path so this is now the incarnation of so-called depth first search we first go deep in a tree or in the graph and another possibility would be breadth first search in which case you would first enumerate A at B and then enumerate all the direct neighbors of A or B and so on and then proceed in layers these are the two fundamental ways how to proceed and essentially shortest path is done in a very similar way just doing the same from both ends at the same time and see the limit so this engine is behind most graph queries so let's study what features the index structure or data structures in a graph database have to have to perform such a thing efficiently now let's go through the different operations first of all in most graph queries we have to somehow find the starting vertex probably given by some primary key or something like that well if you have a hash index on your collection of vertices then it's complexity of one which I mean constant time execution if you have a different index assorted you might have something like log n but usually we use a hash index now at every vertex where you are you have to find all the outgoing edges now for example I was at this vertex S I needed to find the outgoing edges to ADC and what you need is a data structure that immediately gives you the list of outgoing edges it doesn't matter that it's constant time because after all you have to visit all these edges and do the edge filtering so therefore all you need is that you find given a vertex all the outgoing edges in time proportional to the number of outgoing edges and different graph databases do it in a different way what we with ArangoDB do is we use a cleverly designed edge index to find among all the edges in the collection those starting at a certain vertex and that has complexity O of n where n is the number of edges adjacent to these vertices now then of course we have to filter these edges so we have to go through all these edges for a way some we have to find the connected vertices each of these vertices is a lookup in the database again we have to filter vertices and so all together this is the complexity which is linear in the number of neighbors that is the crucial thing if you are at a vertex you have to make all your operations in these graph traverses such that you find all the neighbors in time proportional to the number of the neighbors right, linear complexity sounds evil but I can only reinforce it's not linear in the total number of edges in the graph this would be a catastrophe what you need is that this is linear in the number of neighbors and this we achieve with this technology now notice the following imagine you have a kind of social graph where you take the set of all human beings on this planet and take as edges friendship and there is an interesting estimate that if you were allowed to go at most seven hops from friend to friend and reach every single person for everyone of you in this room is connected to me with at most seven steps of going to a direct friend why is that? well, it is because everybody has a certain number of friends and if you are allowed to go one step then you have the number of friends of a single person you are allowed to go two steps you have the friends of all the friends and this grows exponentially with the number of steps you go therefore in such a social met at work you almost certainly reach basically everybody within seven steps so although these graph theories are pretty fast if you ask a question which has a huge answer for example take the social metaphor of people and give me all the paths of length at most seven well then your answer is so large that your database probably can't finish to pull it out but one has to be careful with these things and do clever filtering and make sure that there are only questions asked which are not too bad the crucial thing is that this technology achieves that traversals are as expensive as the number of results that is all you can hope for we can't do better now let's make this a bit more concrete you have seen a few queries I have told you how this is actually done behind the scenes what the complexity is and so on let's now see how this would be expressed in the ArangoDB query language this for example is a query which you probably can guess what it does it runs through the complete collection of users and returns everything that's not what we want of course we want something like a filter we want to home in on the bond vertex of the user address so this theory just gives you a single user now imagine this is a graph say in a social network and we now want to do a graph query then we do something like this for example this product example where we are interested in the items Alice has bought you do something which looks like a for loop here I can't really point where it says for product in outbound user this means start with a vertex user use the edge collection as bought and go in outbound direction one step so this is essentially the way to say give me all the products bought by Alice and I can make this even more complicated if I want to have a similar query to what I said before with this one here I can ask for paths of length 3 in any direction with this has bought thing and I can return not only the finishing vertex recommendation or the edge action but I can also return the complete path and since I have the path available I can now use this for filtering so what this does is it starts with a vertex Alice it goes exactly 3 steps in either direction of the arrows and it considers only paths where the third step that's path.vertices of 2 that's the third vertex is a user between the h-5 and h-5 of the edge of Alice and I only take products in the end whose price is less than 25 humans and of all the outcomes I just take the first 10 and return this product that's of course made up here but what it should show you is that due to the multimodal nature you can directly express very complicated graph queries as well as document queries all in the same language and we are trying to make it so that it is really easy to understand and easy to perform and the database engine behind the scenes goes away and does this graph traversal for you and returns exactly what you want now so far it was quite easy but let me mention a few of the challenges in this business many graphs have super nodes for example if you take the twitter graph of followers so your vertices are the people customers of twitter and an edge would be if one customer follows another one then there are certain customers who are followed by a huge amount of flow of people some pop stars or whatever so that in the graph would be a single vertex with a lot of neighbors imagine you do a graph traversal in this twitter graph and you just do three steps but you happen to step on such a celebrity and since such celebrities have many followers it's highly likely that you hit one of these and things like this can happen traversal can be very expensive once you hit one of these super nodes however quite often you can avoid this because the questions you have come away from super nodes and so one should be aware of the problem but in practice you just have to have a limit on things and then the query can't explode in front of your eyes obviously as a database manufacturer we have to be aware of this problem for example we have to make it so that if you have a super node in your graph and you delete an edge we can't go through all the edges incident with this vertex and just delete this one edge we have to make sure that this is a constant time operation so for us super nodes pose a certain problem but it is under control the complexity of graph traversals grows exponentially with the length of the path but the crucial thing is that on every level it's just linear in the number of neighbors so these super nodes they pose of course a problem if you hit one in a certain level because then the nice complexity with the number of neighbors doesn't help you because the number of neighbors is huge so in the presence of super nodes you can't really go many steps because the more steps you go the higher the probability is that you hit a super node okay and then okay I want to explain one more idea with this because often you have super nodes but you don't need all the neighbors let me give you another example people use graph databases when they store events or logs play off network connections and usually with these events come time steps so a certain user might have an awful lot of events which is associated to him but you are interested only in the latest 10 so the graph has a super node but you are only interested in exactly 10 of the neighbors now if your graph database has to go through all the neighbors and filter out all of them and we give you 10 that's bad what is needed for that is something which is called vertex centric indexes imagine for example I take all the vertices all the edges and such I think and I use an index which sorts them by the identity of the vertex this edge is coming from and by the time step then I can do a range clearing in that index then on exactly those edges you look for I don't have to scan the million neighbors of these vertex I can directly give you the 10 latest so let's call vertex centric indexes and that is what you have to do if you have many super nodes then only need some of the neighbors now I come to the second challenge the second challenge is if you distribute a graph across many servers obviously graphs as all data become very quickly and therefore the natural idea is to use many machines and just distribute the data now suddenly you still want to query your database and you can make it so that for every given edge or every given vertex you know exactly where it is stored there is some kind of hashing technique the problem is that if you don't do it correctly you have to hop around in your cluster all the time and this is now referring back to the title of this talk what do you do if you have a graph with a built-in edges if you just simple-mindedly distribute it in any other way then what one of these queries will do is we just constantly hop and talk to other servers just to find the next neighbor so let me explain this with a graph with an example imagine this would be a graph with a billion vertices and edges and you have three or say a hundred servers if you now naively just use some kind of hash on the key to distribute this data to end up with this the vertices are arbitrarily distributed across your computer cluster the edges as well and you don't even have the feature that an edge which used to be or which is adjacent with a vertex is on the same server as its vertex so the advantage of this approach it's easy to realize it's easy to store it's easy to scale you don't have to know anything about the data you can just go ahead and do it but most of the time in graph databases the storage and the distribution is not the actual problem the actual problem is the query and therefore you have a considerable amount of disadvantages of this approach namely the neighbors of two vertices the neighbors will be on different machines even worse the edge which is incident with a vertex might be on a different machine and if you want to run a distributed query across this you have a lot of network overhead so rather what you should do is you should use some domain based knowledge about your graph say you have customers then most of the friends of a customer will be in the same country so if you now do a more intelligent charting use some domain specific knowledge and distribute the charts the graph across the charts in a way that most connections are local then you have a much more sensible picture now if you make it correctly then you will have the advantage that most edges are in the same group most edges or if not all will be on the same server as their adjacent vertices sometimes you have to store an edge twice to make it really local to both its ends but you have a much higher connectivity in this way now the only thing which remains is to run your graph traverses in an intelligent way you need to go in and do as much as you can of the work locally before you do the network and continue elsewhere and as usual you have to do batches of things and organize this intelligently and then you can reap the benefits of this better data distribution and run your queries much more intelligently now with the just released ArangoDB Enterprise Edition so in general ArangoDB is totally free it's open source, it has two license but this particular feature with the smart graphs is what I was referring to earlier is a part of the Enterprise feature it uses the main knowledge you as a customer might have to distribute the graph more intelligently and allows therefore to run queries nearly as fast as it would be possible on a single server so to drive this message home this would be a graph setup in ArangoDB where the graph sharding is done in a naive way gives me the opportunity to explain the structure of an ArangoDB cluster we have two types of servers in the bottom you see the database servers they hold the actual data and you see the edges and documents and on the top you have the so called coordinators they are essentially stateless, they don't store data but they handle the incoming queries and the query parsing, interpretation query execution plan making execution plan, distribution and execution plan optimizing this all happens on the coordinator the coordinator makes a plan of how to run this query and then distributes the different parts the different parts of the execution engine all across the cluster now, if you do the distribution randomly and you do a graph query where you say start here with this node and follow the now red edges and you see that you have to hop all around the place and if you use a better distribution of vertices and edges you can distribute your data like this here and you see much of this query happens locally so you start you probably go some way on database server one now you report the results and the terminated state to the coordinator back who runs the query and the coordinator knows ok here's this many loose ends and ask again different database servers and you get a lot less round trips and you have to move a lot less data across your cluster and this is how you actually can pull it off to store say a billion edges and graph with a billion edges and a lot of vertices and chart it across many servers and still get a decent performance for graph queries that is it questions you can ask now or you can ask later and you can contact me by email this is the interesting idea I wanted to convey about graph databases vertex centered indexes and clever sharding and you can scale to a billion edges thanks