 So it's already been said like this talk is about sharding graphs and how to query them and because this This talk is only for 20 minutes We can like we can evaluate all the all the approaches or the possible approaches So we are gonna focus on one that we sort of like found very interesting and feel it a very good results And we implemented that and that's basically splitting your graph your graph database into Into this journal subgraphs that basically have no any physical connections between them and then introducing basically query language extensions that will Enable to easily work with those with those graphs as if it was one graph and To make it more interesting basically we are gonna show this concept on an example and the example is going to be the ldbc social network benchmark So basically ldbc social network benchmark is a benchmark that emulates a social network and it defines the data set data model and Covariated parts basically part of that of that benchmark and what is interesting about is that the data set is The graph is very strongly connected. Basically. We have a colleague who is sort of an in-house expert on ldbc benchmark And when we when we told him like we want to basically do a sharding version of ldbc benchmark He was like you are crazy that that can perform well and we sort of prove him wrong So first like before we go into the sharding. Let's show the interaction to the ldbc model. So we are in picture So basically it's a it's a it's a social network So you have people and they have some relationship They basically have a relationship between them. That's basically the core of every social network then the main the main workload that the people do is basically post messages in forums so So you you you have you have forums that which has which have members and owner and also basically the main workload is that people put posts in those forums and It also they can comment on the post and comment on Comments and of course each post and comment has an author or they also can be liked by a person and There are each each forum each forum or post or Comment have tech basically representing its content Tech have a class hierarchy basically there in hierarchy And people can express interest in some topics basically saying I'm interested in topics marked by those tags and Also people live in cities. It is allocated in countries and each message meaning post and comment have also Basically country where they were created from link to them and Because like Sharding as we learned is not just like about the data. It's also like about the workload that you want to perform So basically you have to when you want to charge something efficiently. You should look both at your At the data and also basically what you want to do with the data. So just like example how a typical ldbc query looks like person in this example is a very nine from the interactive complex category and Given a start person you are looking for messages created by friends of that person or friends of friends that were and the messages have some filtering Criteria they have to be created before a given date So basically this query kind of transfers quite Diverses quite big portion of that graph Yeah so The interesting part so how we just for inspiration how we decided to Shard Shard this model What is interesting about the approach we took to sharding basically like evaluated many many Approaches and this is like the models we came up with what is interesting about is is that is asymmetric basically that means that not all the shards hold Hold to have the same data model data set. We have Always one person chart that basically contains all the information about the people and basically the relationship between them And then we have something that the rest of the shards are what we called forum shards Basically, we distribute distribute all the all the forums on the remaining shards This is a big advantage because the forums are sort of a forest so From craft perspective and they don't have any Interesting correlations shims between them So this is this can be done kind of easily for some kind using some kind of easy sharding functions for sense modular forum ID What's also interesting about this model is that? Basically a person also sort of is represented or those forum shards, but they are sort of like not full representation of the person is basically just a Node with person ID. You can sort of like see it as a Like a proxy note or reference to that to that person shot it actually contains like the full information about the given person and also like for efficiency some of the data is replicated like across all shards these persons the The location structure and the text structure, which isn't such a big problem because this data set This is a very small part of the data set basically like the from the data volume point of view the biggest part are the messages and the forums and that structure and Now when you have the data model the interesting part like how we can easily query that and work with that right so Assuming we have our data now split over Some set of disjoint subgraphs How could we use cipher to to query across those? so we introduced two new constructs to cipher Which are use and call sub query and Use quite simply dictates what graph what subgraph a certain query part should go to So the match here will go to the graph a graph and match only from that graph and the use Is allowed For each sort of query part so in a union for instance you can select two different graphs And the other construct is the call sub graph sub query Clause which is very similar to calling a procedure in cipher Except it's an inline cipher query that is the body of the of the call and Like any cipher Clause this call Block the sub query gets executed once per incoming row. So in this case It's gonna get executed three times each for once for each value of x as we unwind this list of three values and The return values the return columns of the Of the sub query are then exposed and available as new variables in the outer scope so here we return the number of movies and Then we had access to the number of movies variable outside of the sub query and These can then be combined in interesting ways so we can have the use class of course to dictate on what Subgraph the the sub query should execute against so in this case we we just go to some Graph a for the duration of this sub query and then we return to the outer context once it's finished executing and We also have support for correlated sub queries meaning that the sub query can Access variables coming from the outer scope And we have opted for an explicit approach here where you need to Specify each parameter that you want to each variable that you want to access inside So that looks like this with x imports the x from the outer scope and The most powerful use of call and use together is the dynamic use lookup so Where you can go to a different sub graph Depending on data that's coming in so this is again Sort of a correlated sub query Because you sort of choose what graph to execute against based on the value of this graph ID variable So each Execution of the of the inner of the sub query goes to a different sub graph Right, so let's look at this in the context of One of the ldbc queries. How would the cipher actually look to implement one of these? So this is interactive complex number six from the ldbc queries and it reads like this of Expressed it in some sort of pseudo code here So we're given a person and we're given a tag we want to find Friends and friends of friends of this person then we're gonna we want to find the posts made by these friends That have this tag tag and then the final result is basically all other tags of This set of posts that we have found Okay, so how would that look so we would start with a sub query that goes to the person chart Remember we have the person's Network in one shard and then we have sort of different forum shards containing the posts and the bulk of the data so we go here we match persons and Friends of friends and then we return just a collection of the friend IDs So now we've taken care of the friend of friends part of this query and then we continue by Going to each of the shards each of the remaining shards the forum shards we import this friend IDs to each of them and We basically reboot the query at this point by matching for friends Where the ID is in this friend IDs collection, so it's sort of a manual Passing of data through To the different parts of the query this is Alright and at the end we do a final global aggregation so what we actually need to return is the tag name for each of these tags that we found and then we aggregate on the We aggregate the post counts for all of those so we did a local aggregation first on each of the shards And now we do another global aggregation of those Already sort of summed number of posts And then finally we limit everything to 20 because that's what the query expresses and Using this sharding scheme most of the queries in the LDBC interactive complex set of queries can be expressed in very similar ways. This is another example where the The first part is quite similar. We again go to the person shard and find friends of friends That's a common pattern in these queries and we then go to The forum shards, but this time performing some other matching and some other Kind of aggregation alright That's basically all that we have For further reading we have a blog post out About our implementation of this inside of Neo4j The language constructs Are released under the open cipher project And those are open source The implementation the engine behind This sharding scheme is not yet open sourced. We're hoping that parts of it are going to get open sourced in the future Right and Michael has asked me to say that Neo4j is currently hiring engineers in all in all positions right Thank you How do you want to use the shards and not totally be hidden for the user? just fire your old query and the query processor knows how to grab the sharding and then compile the simple user query in the Extensive query like you show right so the the question is couldn't this Directing what shard to go to in the queries be solved automatically by by the system and Yes, it could and we're sort of exploring things like that but Neo4j is currently schema-less and We're looking at solutions of introducing schemas that would dictate where data lives In shards and going that route to eliminate these sort of annotations of where to find Your data, I'm glad you asked Right, so we saw what else can you do with with the fabric with this sort of setup so other use cases of course is like data federation where you might have a bunch of already separate databases and You want to run some sort of analytical queries Across all of your data across your let's say you have a microservice architecture where you have a bunch of different Neo4j stores that are not connected you might run Queries like this to to aggregate data across all of them Yeah