 Een beetje over mijn werk. Ik werkte voor een Duitse regering in opzicht als data scientist. Dit opzicht geeft me de meneer om veel code te doen, zonder de bureaucratie van het Goverment IT-departement. Ik ben een PhD in fysiek en je ziet in het data-science-community dat er eigenlijk veel fysische mensen over daar, dus dat is geen probleem. Mijn positie ook geeft me de opportuniteit, de vrijheid, om de interessante technologie te proberen dat ik vind, zoals graf, analytiek en datafutie, en op hetzelfde tijd kan ik op z'n gebruik van mijn organisatie zijn. In mijn werk ben ik niet gebouwd naar welke technologie, whether proprietary or free and open source. The only attachments, emotional attachments that I could have is to a patchy think-a-pop and Januscraft, because of the work I did for that. So differentiated access control to graph data. We will first explore a little bit the issue so that we all see it as the same way. Then I'll explore three directions in how to tackle this issue. And of course one of the three I've worked out and I'll demonstrate that in a notebook demo. So first why would we need access control in the graph data and what are the tricky points of that. Normally when you ingest data into a graph you have data from multiple data sources. So in this diagram for example you could have data from some application of a business department, let's say buying history. You could have some data from a finance department and you could also have some marketing research team that have downloaded a lot of Facebook data and they also want to include it in the graph. Of course when you have so many data sources you have to deal with privacy laws and often there are also corporate policies in which data you want to combine yes or no. So that introduces the access control. So one example here just say that the business department wanted to do some recommendations to customers but there are corporate policies that they cannot use the social media data or they can also not use the exact location data of customers for the recommendations because the business thinks that customers would feel uneasy if they see that the address data or the Facebook data is actually used for the recommendations, they would soon found out. So in this diagram we see for example that this location data only comes from the finance department and we can assume that the customers and the products are part in all three of the data sets. We can then zoom in in just one issue of graph traversals and access control. Let's say that we have two data stores with persons in it and from both data stores we get edges between the persons for example if they know each other or so. When then some user has only access to the edges of store one this person cannot do this person when it does a graph query finding relations of person one it shouldn't get back this person for because it is not allowed to visit this to see this relation. So now we have seen what this access control in graph data is about and when you think about how you could address this issue there are a few ways you could go. First one is just perfectly isolate graphs that means build separate graphs for each user group that you have but of course we will see that in a minute that has some problems with it. Second possibility would to do this in your graph store to build access control mechanisms into your graph store and the third way we will explore is to just do the filtering in your application itself and see how that forms. So first let's see what happens when you have just make separate graphs for each user group that you have. So I've drawn here a few criteria. I have here the case for one graph for all user groups. We have here the column for the separate graphs. The first thing that hits you regarding scalability is the number of management process that you have. Just maintaining a graph for each user group will be very expensive from a management perspective because you have a lot of processes that go wrong and you have to keep track of all of them. So that's not nice. A second important thing is the available cache memory in your graph database. Of course the performance of a graph database is greatly enhanced by cache memory because often visited nodes will be served from memory. When you have just one graph then you can apply all the cache memory to the single graph. But if you have a graph for each user group then of course the same often visited node will be as often in your memory copies of the data as the number of user groups that you have. So this will just dividing the cache memory between graphs will hit your performance. We will also see this with usage of CPU cycles. Of course when you do some authorization processing this will require more CPU cycles. But also when you have separate graphs per user group this means that all the graph data you have to fetch from disk and send over the network separately for each user group. This also requires CPU cycles. So here we see that in both cases the performance or the required resources will be hit in some way. Of course disk and network I told about. Maybe there is one advantage of having separate graphs per user group that is the case that your graph database will get corrupted in some way. And of course there is just one single user group that is affected. And scalability we talked about. So all in all when we look at this table that we see that if you build separate graphs per user group that is very painful. We wouldn't really like to do that unless maybe you have just three user groups or something. In the second direction let's see whether we can do something inside the graph database itself. And for that we have to see how a graph database stores its data. This picture is taken from the Janus graph documentation. From a high level view point the same holds for Neo4j. Because graphs in general are very sparse. You have to store graphs in an adjacency format. So basically for each vertex in your graph you have a row in your database and in this row there are cells that contain properties of your vertex and it contains the edges, the adjacencies of the vertex. And as we have seen before we can also have access control rules on the edges itself. So this means that you have access control information required on this edge itself. So you already see that you end up with cell level security en in some way the graph database and the access control information you would have to store in your data format somehow. When you look up the way how Janus graph or Neo4j stores its data you see that at this stage is not catered for. En as I see it it wouldn't be very easy to enhance it into it. At least it wouldn't be a one week project or so. It would be a lot of work. So I didn't pursue this way of enhancing the graph database with access control. So then the third direction I would like to explore. En dat is just to do the filtering inside the application. So what we do then is that we just use the regular data format of the graph database and we store authorization properties into the vertices and the edges of the graph. So each vertex, each edge would get an authorization property. En in de most broadest sense it would be just an array of values and in this example I took strings because they are human readable so easy to understand. En of course in the graph itself you wouldn't want this list not very long because all the values have to be stored. On the other hand at the user side you can assign authorizations to the user and this can be a very long list if necessary if you have a lot of sources with a lot of authorization levels. For example here for the business source we have authorization level 1, 2 and 3 had always all kinds of ways you could encode your own authorization rules. So this is a high level view or architecture how you could put this into your application. We assume at the bottom there is an external graph database where you get the graph data from and we assume that the filtering that we do on the vertices and edges that we do in the context of a secure endpoint which is this API and the overall graph application business logic would be at an upper tire. So this part here the filtering and the restriction it would filter all the data that comes back from the graph database and it would also restrict the overall API of the graph database and why that is we will see in a minute. If you use this kind of build up of your application then you do the filtering and the restriction so the security part of your application you do it at one part so you have a proper separation of concerns and as a result it is less likely that you will make any errors in your code to send the wrong data to some user. You only need to have this part absolutely right and this also makes graph applications easier to audit because normally this part would have very few changes while when you extend your application and you will add more graph queries this part will be much more valuable. So this approach this seems feasible but of course to check it whether it's really feasible you have to do the experiment so that's what I did and I'm going to show you that in a minute in a demo notebook Just check the time, yes. First a little bit about I will do this for Apache ThinkPop and it also applies to Janus Graph and how many of you know Apache ThinkPop or the Gramly query language just a few people, okay, no problem. When you in general look at the graph database of course there will be some instance of a graph object where your basic API calls start so when you use basic ThinkPop it would be so called Tinger Graph and when you use Janus Graph then it would be the standard Janus Graph object. From this graph instance you can then instantiate a graph traversal source and that one when you issue a query it instantiates a so-called default graph traversal that can also be nested queries which ThinkPop calls an anonymous graph traversal and we also see that there are some other types of traversal which implement the same graph traversal interface and this interface also has a part of the Gremlin language or DSL. So what we do to have access control into this those four classes and interfaces make up the Gremlin query language and this is the Java Gremlin DSL. What we have to do is just slightly extend those four classes and interfaces so then we get an authorized Java Gremlin DSL on top of the existing ThinkPop APIs and this does then two things it filters the data that comes back from the graph database and it restricts the API because the API gives you all kinds of opportunities to circumvent the filtering and while doing this you see that it is more or less it is very straightforward while only a few cases where it gets tricky and you need to apply so-called stack inspection so you have to see who calls a certain API call and based on who calls it you block it or you block it not you don't block it and of course this is very fragile because now the additional APIs that you make do not just depend on the public APIs but they depend on the actual implementation of these APIs so in future versions of ThinkPop this might just not work anymore and you have to repair this for any version upgrade so how do the traversals the graph Gremlin queries look like with or without this filtering we have here a very simple traversal where we look for the friends of Matilde that live in Brussels and you see while you traverse the graph without in out in steps at each step when you go in the graph you have to filter the user authorizations against the authorization property of the vertex or the ads so this becomes very tedious en it is also error-prone of course and it is also possible to obfuscate the code so that some filterings maybe are not visible anymore when you apply the authorized traversal source so the extensions that I made you only have to provide the user authorizations at the start of the traversal and when instantiating the traversal object you have to provide proper class to it after that the traversal looks like you would have normally in Gremlin all the code for this is available from GitHub and also the demo I will move into is also part of this repository ok, let's see I will keep this graph at hand because we probably need it at the demo this is how the demo looks like you see that there is only a single jar that is added to the class path which adds the classes of the authorized traversal source and of course I pre-loaded I pre-downloaded some jars from the tinkerpop libraries this is just some code to build the demo graph so you see here some add vertices sorry here are some add edges and here we have the unrestricted traversal source and the authorized traversal source we will use further on in the demo and here is the picture again that is built from this graph so again, normal operation we will try here to get all the vertices from the graph and their names when we provide a user authorization of BIS3 so we can just see in the graph which vertices have the authorization of BIS3 and it's only P0 and V1 so let's run it en indeed, on the unrestricted we get all five vertices and the authorized we only get the ones with the proper authorization of course we can do the same when we start from a single vertex when we start from vertex with the name P1 and just do one out step here we'll do the example that we provide two authorizations fin2 and fin3 we start here we provide authorizations fin3 so we can go here, fin3, fin2 also provided but here BIS3 is not provided so when we run this indeed, we only get this second vertex ok, so this is again the normal operation now we come this is the filtering part of the API now we come to the blocking part of the API because the normal TinkerPop APIs have a lot of methods that allow you to circumvent this filtering so when we want to have the security part as a single concern at one place we want to block those methods so that we ensure that the unauthorized data doesn't wander around in our code so here's one example first let's see what happens when we do not provide user authorizations now this very dull it just throws an exception that you should call with authorization first of course you could also try to do it two times and the second time add more authorizations people try things and also here you get an exception that you can only call it once so the object, the authorized traversal source keeps track of how many times with authorization is called then what you have to know is that the user authorizations that you apply to your query that is stored inside the traversal that is just a property user authorization you can even show it with a gremlin query cap is a gremlin step to output any side effects that you have stored in your query so you see here indeed we have provided this free authorization so one thing we could try to circumvent the filtering is just somewhere in our code use some other part of the API called with side effect and just try to override this user authorization property that we have filled with the with authorization call we can also try this we have seen this query before that was the first one that you tried so the P0 in the V1 result is what we really expect for the best free of user authorization so apparently this call here doesn't with side effect call doesn't raise an exception but it is ignored anyway so it doesn't hurt us we see how far I am in my time is still okay here we have another way another method from the gremlin language to try to manipulate this user authorization value that determines our data filtering that is the so called store or an alternative to aggregate step so we just try to inject our value for the authorization in this user authorization property so also try to run that and then we see that actually this user authorization property inside the traversal is an unmodifiable list and the step tries to store it therein but it raises an exception so that's good and here is another one that we can try let's see what happens here the tinkerpop API's also have a few very low level steps to make traversals these are called the map, flat map and filter steps and into these steps you can also provide with a function that get access to the so called traversal object a traversal object that are the objects that really go through the traversal and keep all the data that are necessary during a pass through the graph so that at the end you can get the right output and also these traversals store this user authorization value and it turns out that when you get access to the traversals you can really modify this user authorization value so all these functions that get access to the traversals they are blocked in the authorized API so indeed when we try to do it adjust the API raises an exception it says that the method is not available there is another way to circumvent the filtering we have blocked in the API that is just getting access to the graph instance itself there is a very simple step for that get graph as you say, so once you have access to the graph you can just get a new traversal source and get all the vertices that you want so this was also blocked when you tried to do this you get again method not available there is another way to get access to the graph instance that is by reflection in the Java language so if you do it in this way I've made an example you can just get access to the demo graph instance and when you have that you can do the normal traversal so when you do see that you can just get all the vertices of the graph so when you really want to be sure that your authorized traversal source works as it does and that not somewhere else in your code the graph instance is access anyway you have to set up a policy for your JVM security manager so that this is not possible I have one final example that is about nested queries of course you have to check when you do nested queries whether or not in the nested query the full graph is accessible anyway so here is just one example how you could do here all the vertices of the graph are mapped just again to all the vertices of the graph and this mapping is done every time for a single vertex so here you have to fold the results of the full graph vertices so that you keep a single value again and then here you unfold them you get the values when you run this we see again that we get the values that we expected this is the same query that we did at the beginning just look at the vertices that have the bis free user authorization so what we see here is that the nested query inherits the user authorizations of its parent so that is how we want it en of course in the classes that I showed in begin in the class diagram the class for the anonymous traversal was changed in the same way as the other classes and you can do the same way do the same when you do a nested query that has edges as a result and also here you get the results you expect we can check that what do we here we start from a vertex with the name v1 and we see which in edges it has so we can see it in the figure here is the v1 vertex it has two in edges one with the bis three authorization and one with the fin three we only provide the bis three authorization so we only get one in edge ok, that's the end of the demo so let's wrap up all things I hope I have convinced you that giving proper attention to access control in graphs can be very useful even necessary when you use it in a corporate environment and you see also as already you said in the introduction there is a lot to it and it's not easy to achieve we have seen that if you use separate graphs per user group that's a viable option but it has severe penalties in terms of management a process that you have to do and also performance of your graph we have also seen that cell level security so integrated in your graph database itself seems rather tricky to do this at least it would be a large job and as a fourth issue we have seen that you can do it in the graph this is feasible also regarding performance and actually performance of graph databases of course is often the fetching of data of vertices from disk and getting it over the network and do some additional filtering in the application is not actually very lot of work to do so you do not see this back in your performance we have tried this with very large graphs of tens of millions of vertices so doing it in the application is feasible but still it is fragile because of the stack inspections and also because these public APIs of the graph databases are so vast and they are really not designed for this kind of applications you can only do this in the context of a secure endpoint so you cannot give end-users access to this authorized reversal source ok, I hope you enjoyed it was very hard for me to understand ok, yes the question is rather providing four APIs just provide a single API that would just basically an empty API because you have to build the four and once you get left you can wrap up in one overall API but that would just be forwarding so I didn't see that as useful it's very hard to hear so the question was if you have this framework is it possible to show a client that they haven't got the full data back but they have got a restricted setback of data so we are not seeing the full result ja, I think you just would have to write test cases for that it is really to order the code that you have that becomes much easier because you have there is only one interest into the graph data so you can just control F for that ok, cool I have a question there are also kind of higher level ways of doing that where you could say ok, I put the security constraints on the label relationship type or something like that have you considered that as well that you kind of pull up the level of granularity to a kind of part of the graph metadata structure to say ok these people can't see these kind of these labels they can't see these kind of relationship types and not have to find great I didn't need that for my use case but of course you see that you see that for other big data stores that you have ACLs as a kind of predicates and I have just one predicate on my user authorization property but of course you could implement other predicates as well ja, sure when you tested this on the large multimillion node and relationship graphs what kind of complexity of the security rules that you have was it kind of a lot of user with a lot of different or a lot of like different it's just these filter steps so every vertex that you get you get another filter step at least for the Janus graph database that I use performance is mostly limited but by how fast you get a vertex row back and in this vertex row you have all the data all the edges and after that once you get the data it's a very fast operation to do all the filter steps so that it doesn't hit your performance