 Yeah, thank you again. So I'm the second time you in this room. So in the morning I replaced a friend of mine, Max With this talk. So who has been in this room? I recognize a couple of faces My slides look a bit differently. So you see I have a different style of sliding. So Let's talk a bit about handling billions of edges. So about myself. My name is Michael. Can anyone hear me in the back? Good. So I'm working on Arango DB, which is a database company. I'm personally leading the graph development team there. So everything that is graph related goes over my desk. Started off with the graph visualization, which I handed over to a colleague. But all the graph features and especially graph smart graphs are designed by me. So introductory talk. So who has been in the last talk? I guess this is not new for you. What are graph databases? So first of all, graph databases store schema free objects, which they call vertices. So it means that you can have arbitrary attributes on your vertices and not every vertex needs to have the same attributes. So you can switch back and forth switching the logic of which attribute is there actually back to the client. Then they store relations between them, which they call edges. So here one vertex is for example, Alice, which has a name and an age and Alex has a hobby which has a name dancing and hobby would then be the edge. So connecting the vertices and the edges to one data row basically. Edges typically have a direction. So Alice is the starting point and this hobby is the target. All of the above could be done in a relational database. Plainly you have two tables for vertices, for example, and a joining table with two foreign keys would work. The difference come in in the query language. So for graph database, typically edges can be queried in both directions. So I can say, please follow the direction or go against the direction or I don't care for the direction. If there is an edge, follow it please. That works. And now the difference to SQL comes, you can easily query a range of edges. So let's say I have a starting point Alice and I want to select everything that is two steps away. So I can go over one hobby and go back to another person, Bob, not using the direct relation here. And that is really a one-liner. Even if I do five steps or 200 steps, that is a one-liner in a graph database. If you do 200 steps in a relational database, I think the query is like growing, growing, growing 200 joints and so. You don't want to write that, most likely. Unless you have something which allows for a recursive pattern like Postgres does. Even more, in most graph databases you can actually say, I don't know how many edges I need. I have a starting point and search as long until you find really the result that I'm looking for, no matter how many edges you have to follow that way. Or a shortest path between two vertices. So say we start here at dancing. What is the shortest connection from dancing to fishing? And the graph database will figure out, I start at dancing, go to Alice. Shortest way would be going to Bob and then go back to fishing. What are typical graph queries? Give me all the friends of Alice, which is like a one-step search. So I'm starting at Alice and I'm going one edge, the friends relation to all the friends of Alice. Giving me this result. Or give me all the friends of friends of Alice. Friends of friends means we go two edges. But everything we found in the first layer, so with one edge, should not be in the result. So example, Alice, I go one step to Bob. I go two steps to Eve. Eve could be in the result. Alice, I go one step to Bob and one step to Charlie. But Alice has a direct connection to Charlie, so Charlie should not be in the result. And the graph database can actually find those things pretty easily. Or given any two random vertices, what is the shortest connection between them? And this is not always a deterministic result. Because in this case, if I start at Alice, I want to go to Eve. I can either go via Bob or I can go via Charlie, both using two edges. Both could be a potential result for this query. And now comes the weird part. If I'm standing at a train station and I buy a ticket and it says, you can go up to six stops. You can change lines as often as you like. Where can I go? So I'm starting at this point and I'm allowed to go six edges long. So the graph database actually can do this in a one-liner and can find all the stations where I could go, even with switching lines. But the most typical one is the so-called pattern matching. Pattern matching is you have like a huge data set, which is your graph, and you define a subgraph, your pattern. Say, give me all the users that share two hobbies with Alice. Pattern would be we have Alice, we have a relation to one hobby. Alice needs another relation to another hobby, which is not the same. And then we need another vertex, which is a friend, having a relation to both of these hobbies. That would be the pattern and the database should then find all the friends of Alice that have two hobbies in common. Or, and now we make it a bit more complicated, give me all the products that at least one of my friends has bought, together with the products that I already own, but only 20 of them, or limit them by price or whatever. And you see what they were at this going, this is a super simple recommendation engine. Because probably what my friends buy together with the stuff that I already have is probably interesting for me as well. It's not the best one, but getting close. What are non-typical graph queries? And those are queries where graph databases typically don't perform well, because they're not optimized for them. So graph databases are really optimized for those queries that go alongside the edges, like these friends matching or pattern matching and so on. Graph databases typically don't perform so well if you are only querying across attributes, so not across the relations, but for the attributes. So give me all the users which have an age attribute between 21 and 33. Document store or relational database would be much more suitable for this type of query only. Give me the age distribution of all my users that I have, basically the same, or group all my users by their name. I don't. Yes, typically. Typically, those are just the queries where the graph database are not specifically optimized for. So relational databases are more optimized for those queries. Graph databases should focus more on the traversal stuff. I'm not saying that it's not possible. It's just the other ones put their focus more in that direction, at least from my experience. Because the major thing that a graph database does is the so-called traversal and how that works, I would actually like to explain by example. So first of all, we pick a starting vertex, which we find in a way that is as fast as possible. Then we have to collect all the edges for this starting vertex and collect all the neighbors that are connected with these edges. Then we can apply filters on the edges and the vertices and we end up with the first subset here. So now our traversal started at S and it will go on to either A or B because we said please go to edges. So it randomly picks, let's say A. Then it does the same step again, checks for all the edges that are connected to A, applies the filtering again, and we'll figure out OD is probably connected, but we filter it out. Then E is also probably connected. And then we have the first result, S to A to E. It has two edges and the filter is matched and this would be one part of the result. Then the traversal could continue here, but we said it should stop at two edges, so it doesn't. Instead it goes back to A, tries or checks if it is another edge that it has to process. In this case not because we removed D from the result set and E is visited. So we go back to S and check if there is another edge and here we find B. So we go to the unfinished vertex B which we haven't visited yet and apply the same pattern again. So we iterate down to B, apply filters on all the edges and go down to everything we find. So in this case SBF would be another result of this traversal query. Let's talk a bit about the complexity of such a traversal algorithm. First we have to find the starting vertex. How fast we can do that depends on the index. In best case it is a hash index then it's constant time to find it. If you only have a sorted index then it's logarithmic time to find it. If you don't have an index at all your database will be rather slow because it takes like the amount of vertices you have. So put an index on it and you can find it fast. But now comes the important part. So for every depth that we have that we want to search we apply the following four steps. First of all we find all the connected edges and this is either done by an edge index or with so-called index free adjacency. Both of them are basically required to be in constant time because this is most likely the operation the database will do most of the time. So that should be as fast as possible. Then we have a set of edges. Now we have to walk through this set of edges and for each check if the filter conditions that we apply match for this edge. If the filter conditions match for the connected vertex and of course we have to find the connected vertex. All of that can be done in one go so three times n is required only for these things. Linear complexity but as we do this for every depth that means we start with one vertex so one times three times n. But in the next step we actually start with three times n starting vertices. So that means three times n to the power of the search depth. That's the complexity of a traversal algorithm. However sounds evil but it's not linear in the total amount of edges and that's the great benefit. It's just linear in the amount of edges attached to each vertex. So if you have trillions of edges in your data storage stored but every vertex just has five of them. All your traverses will still be fast unless you search for a depth of five thousand or so. Because you only scale with all the edges connected to one vertex. There is a certain drawback that I will come to in a minute. And typically the number of edges connected to one vertex is much smaller than the total amount of edges that you store. So the benefit of traversals is they only scale with the result size. So if you increase everything you store in the database but the result size stays constant for one search will have the same performance. And there is a rule that I learned at university which holds true for all social networks. So everything that is grown naturally and that's the so-called seven degrees of separation. Seven degrees of separation means take any social network. So Twitter, Facebook, whatever. Take any two random persons. And you have an extremely high chance that the shortest path between those any two persons is of length seven or less. So that means if you have like an unfiltered traversal of depth seven that means that you are most likely returning like 99% of your graph database. As long as you have a naturally grown graph. If you have an artificially grown graph may not be the case. So let the user create edges. You have a high probability that you end up in this case. So expect this query to be slow on a large data set. So let me drop a couple of words about Rangody B because that will be the technology that I'm going to use to show you how to execute those queries. So Rangody B is a so-called multi-moder database. That means it can store key value pairs, documents and graphs with only one core in the same database. So I can use all the three data models with one query language and one process basically running on top of that and combine them in any arbitrary way. We have a query language which is called AQL which has document queries, graph queries, joins. All these things implemented in the same language and can arbitrarily combine all of them. And it has asset support for multi-collection transactions on a single server. So not yet on distributed version of Rango. How does AQL look like? So the syntax is not SQL because we found it rather confusing to like put 90% self-implemented stuff into a SQL. And yeah, let the users like try to use SQL statements but they kind of not work. So we decided we need a really distinguishable language which makes it clear that you're not writing SQL here. And of course SQL does not really fit to document stores and even not really to graph databases. That's why we settled for a new one which is more focused like XQuery. So a query that allows you to query XML. Basically you're always writing for loops. So for user in users means I'm iterating over a collection comparable to a relational table. And every row I find will be stored in the variable user and will be processed later inside the query and could for example be returned. Of course I can apply filters. So I can say please only return me those users with the name Alice. And the query optimizer can now figure out oh there is actually quite good index on this name. So it will automatically use the index in your query. You will never say I want this or that index inside my query. The optimizer handles that for you. So with this query we actually find everyone who is called Alice. And as promised we can combine it with graph traversals. So we now have at this point a user in our hand. And we want to continue starting at this user. So for a product which is again the variable returned by the following statement. In one outbound step starting at the user following the relation has bought. And return this product. Then we end up with this following pattern we start as Alice which is pretty much defined. Find one edge and connect to one product. Can also make it a bit more complicated. So the graph traversal statement can actually return three return attribute. So three return values. First thing is the last vertex that is standing on. The second one is the edge pointing to the last vertex. And path is the entire path. So let's say in this example if I'm standing at play session, play session would be my result for recommendation. This relation would be the action and the entire thing would be the path. Now I'm doing three steps or I can also put in here one, two, three steps or whatever. I'm using any direction so I don't care for the direction of the edge. I can go inbound or outbound whichever I like. And of course I can apply filters. So my filter would be the second vertex on the edge which is this guy here. Should have an age which is equal to the user's age plus or minus five in this range. So someone who is about my age. And for the recommendation the price should be less than 25 but only I want the top 10. Simple recommendation engine again, not the best choice but kind of works. And of course the optimizer will figure out if the second vertex on the path does not have this age, then it will not continue searching there. So optimization is handled but now let's talk about challenges. What is the first challenge if you want to scale a graph database? And I think that is the most common challenge. Supernodes. So supernodes in many graphs you have so-called celebrities. Super important people or items whatever which have many many many inbound or outbound edges. So let's say if you go for Twitter and you pick let's say Donald Trump. He has like super large amount of followers. So whenever your query walks over him you have to check all the followers of Donald Trump which are a lot. No political statement here please. And as we said the traversal scales only with the amount of connected edges that we have. So this guy will be super expensive. If your query instead would go over my account you have less followers unfortunately. So the query would be faster. However often you only need a subset of those edges. So in most cases you only need like the top 10 newest 10 or edges that have a certain attribute onto them. So what we can do as a first boost is the so-called vertex centric index. A vertex centric index allows to index an edge based on the connected vertex plus arbitrary attributes on the edge itself. It can then be sorted or it can be like only for equality like a hash index or whatever index type. Important thing it is connected to the vertex plus something stored on the edge. The thing is the index so finding the result in the index if it's not sorted is constant time at least for a hash index. If it is sorted it may be logarithmic time but still faster than fetching all the edges connected to that vertex and then iterating through the entire set. So if you have this index in place you have either less or no post filtering required. And thereby this decreases the end in this equation drastically. And you get kind of fast queries. But now let's come to challenge two. Big data. So we have big data now. A lot of companies out there just store everything you can so whenever you click with your mouse on a website it stores an event in the database. At some pages at least. And of course this also happens to graph data. So from our opinion on from our customers we see that the data set easily grows beyond a single machine. Even these super large machines that you can buy nowadays. Of course also for graph data so we need a challenge or we have the challenge to actually allow graph processing for more than one machine. In all the graph algorithms that we want. How can we do the scaling so the general scaling idea. Is so-called charting so you distribute the graph or the data set in general on to several machines in so-called charts. Chart means you have your huge data set you just chop it in parts and each part belongs on one machine. No machine holds the entire data set because the assumption is the data set is too large for this. However every machine could hold in theory more than one chart. So charting itself is kind of solved and kind of easy. Complexity comes if you like to query it especially in the graph area. So how to query it now. First of all we cannot get a global view of the graph. Because the assumption is it doesn't fit on a single machine so how should we compute it. And at some point we have to chop the data set. So what about the edges between service. Challenges in addition in charted environments the network most of the time is the bottleneck. So if you have a query that just constantly jumps between service. It's most likely slow. And so you want to reduce the network as much as possible and keep as much as possible on a local machine. So the target is to reduce network hops. Again we could use vertex centric indexes for super nodes. However they only work on a single machine. At least there is no distributed implementation that I know of right now for vertex centric index. So if anyone knows please correct me. So vertex centric indexes only help for single machine. Nevertheless so for one chart you could add a vertex centric index. So for one chart you could add a vertex centric index and it should be better than for local query. So now first let's distribute the graph. So what are the dangers of charting. So first of all only parts of the graph are on every machine. Neighboring vertices may be on different machines. And even edges could be on another machine. At least in the orangut db case I think for Neo4j that's probably not possible. I think you don't have this charting mechanism at all right. So if you just distribute the data randomly it could be that one vertex is here the edges there and the third vertex is over this machine. Potentially. Next queries need to be executed in a distributed way. So you need like some part that coordinates or I have to fetch some data here some data there and some data over here. And then the result needs to be merged locally so make sure if you have a distributed graph engine that your result is not the entire graph. That will kill your machine. So distribution techniques. Easiest distribution is what I like to call random distribution. The advantage of random distribution is every server takes an equal portion of data. Because the idea of random distribution is that I take every single object be it a vertex or an edge. Throw a dice and depending on the result I put it on one of these machines. I don't care for the surrounding. Random distribution is super easy to realize because master random is like pretty much implementable by everyone. And you don't need any knowledge about the data. So it works on any data set. Disadvantage. And this will be clear as soon as we distribute this graph on these machines. The neighbors are on different machines. You have totally lost the entire layout of the graph. It's like not connected anymore so reconstructing the original graph of this thing is quite hard. Probably the edges are on different machines than the vertices and a lot of network overhead is required for querying. Nevertheless it works but expect it to be slow. So let's first switch to the demonstration. Shortly so this is a RengarDB web interface. I have imported a social network which is called PoCake which is available from Stanford University for download. And I have imported it two times. One time with the random distribution. One time with the smart distribution that I'm going to cover in a minute. It has around 1.6 million profiles. If I'm not mistaken. It's not printed there in this resolution. And it has 30 million edges connecting those vertices. And the random distribution actually uses or places the documents and the vertices like somewhere in my cluster. So I'm running a three instance cluster, three physical servers. All of them running each in a RengarDB. Now I have a query. A simple two step traversal. So I'm starting at one vertex. I'm using the bind parameters here because I'm not changing the query later on. So I say the starting collection and a graph that I should use. The random distributed graph. I apply some filter on the first level and I just return the key because it's easier and I don't want the network traffic to slow down the result. So if I execute this query. And wait. And wait a bit longer. So in 1.5 seconds I can actually find 5700 vertices matching this condition. More importantly this time. 1.5 seconds. This is not what we aim for. So now let's go on for scaling. First of all index free adjacency. That is the mechanism that is used by most other graph databases. And that means that every vertex actually maintains two lists of its edges. The in and the out list and they are physically attached to the vertex. Has the advantage that we do not need an index to find those edges. Because whenever we fetch a vertex we have the list of edges. We only need to fetch the connected vertex again. But if we have this condition that it has to be physically attached to the vertex. How could we shot this? If we want to shot it in a good way. So first part is clear. Second part is clear. Where to put the edge. Because if I put it on the left machine. Only it would violate the condition on the right one. And vice versa. I could duplicate it. But then I need to have some logic that keeps both versions in sync. When I write to the different machines. ArrangedDB goes a different path. So we have a hash based edge index. Which is constant time lookup. Because it's hash based. But this makes the edges independent from the vertices. So the vertices just know okay I have to ask there. And find my list of edges. But this actually allows us to like pick any of these two servers and saw the edge there. And are not forcing us them on a on a specific machine. But now comes the tricky part. Domain based distribution. So what we have found in our customers is that many graphs have a natural distribution. Natural distribution means that most parts of the graph are actually on. Largely connected they share a common property. So for example if you have a social network. Friend network. Typically most of your friends will be from the same region or country. Then you are. You will have some friends abroad. But most of them will be local. Or if you are having storing blocks in a in a graph for whatever reason. Probably blocks with the same tag are related to one another. Category for products. And couple of more. Just some examples. So for this distribution means most edges are only inside the same group. So connecting people from Brussels. And you have rare edges between different groups. But still they are allowed. And if we now apply the domain based distribution on this graph. We can actually shard it in a bit better way. If we use the domain as a sharding attribute. And there we put large parts on one machine. Another large part on the second. And another large part on the third. And we have a choice where we put these edges. So either this or there or there. If the database now knows that this condition is true. That most of it is actually on the physical machine. It can do much much more operations locally. Which are much faster than if you have to do network in between. However, this is only available in a Rangody B Enterprise Edition. And if. We use this domain knowledge for shortcuts. Where's my demo. Over here. So same data set. And I'm just using the different distribution of the data across the service. So only thing I change is I'm not using random. But I'm using smart. And off we go. One, two, three. 300 milliseconds. Exactly the same result. So if you want to, you can compare those things. But we could make use of the knowledge that we have most of the data local. And just somewhere we needed to do a network hop to get and collect the other parts of the data. I have seen much more impressive numbers on some of our customer data. Where they could in best case like fit the entire graph or parts of one customer graph on a single machine. And just needed to shot because they have many customers. And then they get really down to single server performance. Although they have an arbitrary scalable cluster. At the hands. Yeah, skip this animation. Good. How does it work? So this is the Rangody B's architecture in the cluster. We have database service responsible for the data. And we have coordinators which are the user facing servers. So the user asks a coordinator with a query. The coordinator will figure out which database servers play a role. Distributed down to the database service and they physically fetch the data reported back. So if I now create a long path. Which is actually quite sorted already. Just for the sake of demonstration. We see here that we need one, two, three, four, five, six network hops to collect the entire result set. If we could shot it a bit better. We can get away with a single one. Much faster. And this is the whole idea on how we actually get a lot of speed in a scaled network for point queries. Question please. So the query is how we like remove these shots, these vertices around if we do it adaptively or if the graph is loaded. So for the version released, you have to give the sharding attribute upfront. And we put the vertices by that sharding attribute. Our plan is to do this more adaptively in the future. But that's not implemented yet. Yes, please. So the question is what happens if the graph changes over time. The assumption is that in most cases distribution kind of stays the same. Of course you will have people that move out into other areas. So for these, actually the query will get slower. Because then the condition first doesn't hold anymore because most friends will be in a different location. But over time, probably the friendship changes. For this guy as well to the next or local area. So it may be possible that if that changes, some queries may be slower than before. Because the distribution changes for certain vertices. But in general it works out quite well. Because the core of distribution doesn't change like randomly around. In most cases. So the question is if the sharding key has to be defined on every vertex and that's true. Good. Time for questions. Any questions for Michael? Yes, please. Today you learned a lot about the languages like OpenCypher or Gcore or Pgql. Did you at some point consider also implementing that kind of language? Yeah, so the question is if we have considered implementing different query languages. So we have thought of it for sure. But the thing is we are a multimodal database. So our query language has the requirement that it can cover all the data models that we have. And graph query languages are typically designed only for the graph module. So for us it won't be able to handle like 500 query languages which are out there right now. Okay, I have a question. Yes. Because the talk is titled handling billions of adjust. What were the largest deployments of Arango to be that you used to smart sharding for? In terms of how many Bayon edges? So the question is what was the largest setup that we have seen so far? So it was above 1 billion vertices but I don't have the numbers like in the back of my head. Above 1 billion and 10 or 20 times as many edges. So works out quite well. Cool. Yes, thank you so much. I have t-shirts. So whoever likes.