 This is a joint work with Agnishon who worked on this as an undergrad student in CMI now he is a PhD student in Rice University. So, I will begin with broadly describing what is a watermark. Suppose you have some kind of a digital object, the most general term watermark is some kind of a modification which is difficult to undo by someone who does not have the original. The motivation is that you do not want other people to have the original with them. You want to be able to prove that you only have the original copy and others are all you know duplicates. But another sensible condition that you do you would want to put is that this modification should not deviate too much from the original copy. If you completely modify it which does not look at all like the original then there is no point. So, you should ensure that the deviation is not too much now how do you measure how much is too much. So, that depends on what kind of digital object you want to watermark. So, for example, let us consider this work where they studied how to watermark weighted graphs. So, the motivation for this actually comes from the need to watermark maps where you can see how to get from one place to another. So, it is the underline model is weighted graphs where you have weights on edges and they studied how to kind of slightly modify these weighted graphs in a way such that if you take any two pace of vertices then the shortest distance between them does not differ too much between the original and the watermarked copy. So, here how do you measure how much is too much you just have two numbers and you just say that the difference should be within a constant ok. So, this is one example of how you measure how much is too much. Now, so this has been adapted to relational databases and the goal is to ensure that results of structured queries do not differ too much amount to the other. By here structured queries I mean structured query languages SQL which in other words is actually just first order logic. So, let us look at an example of a simple database of an organization with the salaries of the employees right. So, suppose you have this list of employees the cities they live in and their salaries and you want to watermark this database. One way you can do it is to kind of slightly change the salaries. So, you can maybe change John's salaries by adding one and you can maybe change Pooja's salaries by decreasing one. So, now there are a few properties of this modification that I want to highlight. One is the local distortion is bounded in the sense that you take the corresponding tuples from the two databases the difference is bounded by some constant. So, this is what I call the local distortion because you just take the corresponding tuples and the difference should be bounded by a constant. But we want to satisfy one more goal which is a suppose we want to run a query on this. So, this is a SQL query which just says that give me the list of all employees and their salaries such that the employees working in a particular city and which city is given as an input to the query. Now, suppose you say the city is Chennai then it will say John 10,000 and Pooja 15,000. So, this can also be equally written as a first order logic formula. So, this is the usual notation where psi is a first order logic formula and you say these are the free variables here we want to distinguish between input variables and output variables. So, I put city as the first tuple just to highlight the fact that city is going to be an input variable and what is going to be output is this pair name comma salary. And this is just a standard first order logic formula which says that this triple should be in the relation employ table. These two are just to say that SQL queries are just first order logic formulas. Now another goal that I want to satisfy in my watermark is that irrespective of what city I give for this query I want a certain guarantee on the output. Now you can run this query on both the instances. In the first instance you will get John 10,000 and Pooja 15,000. In the second instance you will get John something and Pooja something else. Now, if you sum up all the distortions in the output the distortions will add up to 0 in this particular instance. So, that is what I call global distortion. If you run a query you might get multiple tuples as output. Since it is locally bounded we know that each tuple by itself does not distort too much from the original. But it might happen that if you have a huge list of tuples and each of them is locally bounded but they might add up to some huge distortion. So, this global distortion bonding is to avoid that. For a given query I want the global distortion to be bounded. So, these are the goals of my watermark. These are not the only goals there are other goals that I will come to later. So, let me give a semi-formal definition of watermarking databases. So, as input you are given a database schema is just the list of relations and their arities and you are given a query. The goal is to come up with some kind of a scheme to distort instances of this database schema. And this scheme should satisfy certain goals. As we have seen we want the scheme to ensure that the local distortion and global distortion are bounded. But another goal that we want is that for larger and larger instances of the database you want to be able to come up with more and more number of distortions. So, this is another goal and the motivation for this is. So, let me try to give a little more practical motivation. Suppose you have some database like this with some confidential information about employee data. Now, salaries of employees in companies is actually very valuable data because other companies would want to hire good employees and they want to see how much salaries being given by their competitors. Salary data is actually quite marketable and if you have spent some energy and effort in collating this data and you want to market it. But somebody else still see your database and tries to make money out of it. You want to prevent. So, you want to go to them and say, hey you have stolen my database and you give me your database I will prove that it is a duplicate of mine. And then only you have the original and you know what are the distortions that you have made. And hence if you have access to that database then you can know you have some kind of a technical tool to prove that what they have is a duplicate of yours. But the problem here is that if somebody has stolen your database and they are trying to make money out of it it is very unlikely that they will give access to their copy of the database. So, what you have to do is go and pretend to them that you are just a normal customer and you just say that ok you just run this query for me right. So, in a way that they would not be able to distinguish between you and somebody who is trying to prove that their copy is duplicate right. So, you just go and pretend that you are normal customer and you just run the query that in way you are allowed to run. Just by running the same query you should be able to prove that what they have is a duplicate. So, you can do that in this example because all you have to do is just run the same query select name and salary from the table ok. Just by looking at the output and comparing it with the output that is given by your original copy you will be able to prove that this is actually a duplicate. So, this is the motivation for trying to design such watermarking schemes. Now, what is the motivation for demanding that for larger and larger databases you want more and more ways of distorting it ok. So, the motivation for that is slightly technical. One is that you want to be able to fit this watermarking schemes into a scenario where there are adversaries who know that you have watermarked it and they may try to erase the watermark ok. So, they might try to you know use some kind of a randomised algorithm to say that ok this person might have distorted this figure let me bring it back or something. So, if you have a large number of ways to distort to begin with then you will have more room to you know fight with the adversary and there are actually meta theorems which say that if indeed your watermarking scheme is scalable in this way then adversaries cannot do much. Based on some assumptions we say that ok your adversaries cannot have superhuman knowledge. So, if you have a reasonable adversary who tries to kind of somehow delete a watermark then you can overcome such adversaries if you have a scalable watermarking to begin with ok. So, this is the motivation for demanding scalability. So, what has been done so far well it is be easy to construct instances of databases where even trivial queries cannot be watermarked ok it is quite easy to construct such instances. So, there has been work which identifies sufficient conditions for existence of such scalable watermarking schemes ok. So, some of these conditions are based on what is called a Gaffman graph of a database. Gaffman graph is nothing but a graph where your set of vertices is a set of elements from the domain of the database. So, for example, here your vertices will be just all the names, all the cities and all the salaries and there will be an edge between two vertices if those two elements participate in some to pull in the relation. For example, here there will be an edge between John and Chennai, there will be an edge between Chennai and Ten Thousand, there will be an edge between Ten Thousand and John. So, similarly each triple will give rise to a triangle in your graph. So, that is the Gaffman graph. Now, what are sufficient conditions? So, one sufficient condition is that if your Gaffman graph has bounded degree then FO queries can be preserved and you can get a scalable watermarking scheme. So, the same paper also gave another result, it says that if your Gaffman graph is similar to a tree. So, here similarity is measured by something called tree width of the graph. So, I will not give the formal definition of that if you are not familiar it is not very important that you understand right now, but if your Gaffman graph has small tree widths then MSO queries can be preserved. More importantly than just these two results what the author actually did is to identify this fact which is that the techniques that are used for proving bounded VC dimension much over here. So, this Wappnich-Scherro-Ronenkis dimension is actually an important concept in learnability theory. For some reason that we will see later techniques that are used for proving bounded VC dimension also work for proving that you can get scalable watermarking schemes. So, this is an important observation made in that paper and our work actually started by looking at that observation ok. So, if the techniques used for VC dimension work here then it is also known that FO queries have bounded VC dimension for a much bigger class of graphs not just bounded degree. So, it is already known that FO queries have bounded VC dimension for graphs that have locally bounded tree width ok. So, this locally bounded tree width is kind of a generalization of tree width. Tree width is measuring how much a graph is similar to a tree, but locally bounded tree width is kind of a localization of this concept. A graph is said to have locally bounded tree width. If you take any vertex in the graph and you take sphere of radius r around that vertex then the sub graph induced by that sphere should have tree width bounded by some function of r. For example, if you take any planar graph and you take any sphere of radius r in a planar graph then the resulting graph will have tree width at most 3 times r. So, these are examples of graphs with locally bounded tree width and it is already known that for such graphs a FO queries will have bounded VC dimension. So, then our work started by observing that well if techniques used for VC dimension work here then we should also be able to prove that for databases with locally bounded tree width you should be able to design water marking schemes which preserve FO queries. Now, you cannot have click in a planar graph for example, nowhere dense is even weaker condition than this. So, I will I will mention it in the final slide. So, this is our result. If you take a database instance whose Gaifman graph has locally bounded tree width then FO queries can be preserved while water marking schemes. Let us understand why is it that bounded VC dimension techniques work for water marking schemes. So, explaining the full thing is bit time consuming which is not here, but if you kind of begin to both the proofs finally, what you find at the bottom is that in both of these scenarios what you really need is to identify pairs of tuples that cannot be distinguished by your FO query. So, what I mean by that? So, suppose you have two pairs and you have an FO query they cannot be distinguished by your FO query if either both of them are in output or if none of them are in output. So, these are kind of indistinguishable tuples which cannot be distinguished by your FO queries and it so turns out that this is important in both bounded VC dimension as well as water marking schemes. So, this is the basic idea why bounded VC dimension techniques work in water marking schemes. So, that is that is how we prove a result. So, the way bounded VC dimension works for this case for example, suppose you have a graph of bounded trivet, how do you identify pairs which cannot be distinguished by MSO queries. So, this is actually a simple application of Coussel's theorem and Pijano principle. Coussel's theorem says that if you have a graph with bounded trivet then MSO queries can as well be run using tree automata. So, let me try to draw a figure. So, suppose you have a graph which looks very much like a tree and you have an MSO query which now is actually a tree automaton and you want to identify let us say two vertices which cannot be distinguished by your query in our case which cannot be distinguished by a tree automaton. So, what you do is identify a subtree which has at least mark u plus 1 nodes where q is the set of nodes of your tree automaton ok. Now, this is a label tree and how do you identify whether suppose let us say this node is actually a part of the output you just put a special label on this node and check whether the run of your tree ends up in some final state. Now, how do you identify whether some other node is also part of the output you put that label on the other node and check whether your final state is actually a final state or not. Now, there are you know mark u plus 1 nodes here. So, you can put mark u plus 1 labels and there will be at least two states which repeat ok. So, those two nodes will be indistinguishable by the query because whether you put a label here or there it is going to end up in the same state and so either both of them will be accepted or both of them will be rejected. So, this is the technique which was identified in the bounded VC dimension work and exactly the same technique will work for water marking pairs because if you remember that is what we needed actually for water marking right. We just want to identify you know two pairs irrespective of what CT you give either both of them will be in the output or none of them will be in the output. So, you can add one to one of them and subtract one to the other one then the global distortion will be bounded ok. So, this is the basic idea you want to identify output to pulse which cannot be distinguished and then you can come up with a scheme. Now, how do we extend this to locally bounded rivets? Now, we may have a database and rivet may not be bounded, but what you are guaranteed is that if you take any sphere around a fixed vertex then that sphere will have bounded rivets ok. So, here we take help of a classic idea which is that well if you have locally bounded rivets you can also localize your queries and Giffman's locality theorem already tells us that first order queries are localized. So, in detail suppose you have a full formula psi with a free variable x and you want to check whether in a graph G if you assign a vertex V to this free variable x this formula psi of V satisfied or not. It is for a given query. So, same scheme may not work for a different query. So, now you want to check whether this graph G satisfies the formula psi of V. What Giffman's locality theorem tells us is that the answer to this question depends only on a fixed number of spheres or fixed radius. Both the number of spheres and the radius of the spheres depend only on the formula not on the graph. So, now the observation is that well you have some fixed number of spheres each of which have fixed radius due to your assumption that your graphs are locally bounded rivets all of these spheres within themselves of look like trees. So, now on those spheres now you can run your tree automate. This is how the classic idea works, but in our setting it cannot be directly applied because it works here because you only want to check whether a certain formula is satisfied or not. But what we want is to come up with a huge number of pairs which cannot be distinguished by the query. So, for that we just have to do a little bit more work and this little bit more work is actually the main contribution of this paper. So, I will try to explain this little bit more work in the remaining 3 minutes with a very small figure. So, again this is another classic idea that has been applied on grass with locally bounded rivets, but here we need one more assumption which is closure under minors I will explain why. So, you start with some arbitrary vertex in your graph and you split it into layers. Each layer has constant distance from the original vertex. So, L 1 is the set of all vertices which have distance 1 from V, L 2 is the set of all vertices which have distance 2 from V and so on ok. Then you take some layer L i and leave a buffer of theta layers on both the sides. So, this theta is the radius of the sphere that you want to leave because you want to run a FO query on these vertices and so this theta layers kind of ensure that you do not have overlapping spheres right. So, now what you do is take the union of these 2 theta plus 1 layers. So, if you have a graph with locally bounded trivets and the graph is from a class of graphs which is closed under minors it can be proved that the union of those 2 theta plus 1 layers induces a sub graph which has bounded trivets. So, now you run your tree automaton on that union of 2 theta plus 1 layers, but you take pairs only from this layer L i. I do not take vertices from other layers because if I take some vertices from there then there might be a sphere which you know kind of overlaps into some other layer and it will start interfering with each other. So, just to avoid the interference I only take pairs from this layer then I leave a gap of 2 theta layers and then I take pairs from the other layer and so on. So, from each so if you have let us say some n layers then at least from n by 2 theta plus 1 number of layers you get pairs. So, that is how if you have a huge database then you can collect a huge number of pairs which are indistinguishable from your query. So, this is the order there are a lot of details that I have kind of brushed away because there is not enough time so just to conclude. So, one important weakness of this work is that so what we have done so far only works for unary queries because so here I said I select pairs. So, implicitly assume that the output of the query is only one vertex right. So, what if you have a binary query then the output will be actually set of pairs. Here this technique does not work because there might be an output where one vertex is here and another vertex is somewhere else in the graph. So, the Gaiffman graph does not capture enough information in this case. There might be a tuple but the Gaiffman graph may not have an edge between the members of the tuple. So, the Gaiffman graph is not going to help in this case, but this is a bit puzzling because bounded VC dimension works. For bounded VC dimension all you have to do is prove that unary queries have bounded VC dimension then there is a huge hammer from classical model theory which says that if you are able to prove that unary queries have bounded VC dimension then all queries have bounded VC dimension. But this is a very, very powerful result in model theory. We broke our head over this for some time but we were not able to replicate it for water marking but we hope that it can still be done. So, another question is do we really need this assumption of closure under miners? So, in this particular technique we need the assumption of closure under miners because we want to conclude that the union of these two theta plus one layers have bounded rivets. Without closure under miners that cannot be proved but again for bounded VC dimension they do not use that assumption. They only use the assumption of locally bounded rivets. So, another question is can we drop the assumption of closure under miners? So, yet another question is what about conditions weaker than locally bounded rivets? Like Ramnatham mentioned, so if you look at algorithm design for example, if you want to check whether a given graph satisfies NFO sentence. So, again the story started very much like this. The story first started by saying if you have bounded degree graphs then you can do it in fixed parameter tractable time FPT time. Then somebody went on to prove that if you have locally bounded rivets with closure under miners you can FPT, you can have FPT algorithms for a formative checking. Then somebody proved that even if you do not have closure under miners but you just have locally bounded rivet you can have FPT. Then there is a huge series I mean then they showed that if you just exclude a minor then you can have FPT. Then if you can locally exclude a minor then you have a FPT. Then I think bounded expansion then locally bounded expansion. So, there is a long story which culminated in a class of graphs called it is nowhere dense class of graphs. So, they have shown that if you have graphs which are nowhere dense you can have FPT algorithms for FO model checking. And this is also kind of a if and only if you have FO model checking algorithms in FPT then under some complexity theoretic assumption you know that it is bounded nowhere dense. So, can we have the same story here for water marking that is and again all these are just sufficient conditions can this also be converted to necessary conditions. So, I will stop here. So, motivation if I am just giving my data base for one query to someone else I think this as well give the output of that query to that person why do I have to get the data base? No, but the query can be run for different input parameters right. So, if you have a huge organization which has branches in multiple cities okay maybe the data base can change. No, the data base can change, but also maybe this company with city as the input parameter is not a good example, but if you go back to the original example of rated graphs now you have a huge number of tourist spots in some cities and we have put a lot of effort into calculating the shortest part. But now even if you give somebody just the shortest between let us say some neural care of what is it is still not as good as giving the whole data base. So, in the whole data base you have much more information than just the shortest distance you have shortest distance again all that. So, here also generally the idea is the huge data base there is a query which can be run with a huge number of input parameters. So, then giving just one output with one particular input parameter is okay, but we do not want to give the input data. Do you mean compute the water marking scheme finally or is it just existent? Computing is not very difficult only thing you have to do is just identify place which cannot be distinct. So, once you identify place the actually the computing the water marking is not very good because let us say you all John and Pooja are indistinguishable then you just you just get one distortion by saying plus one minus one another distortion by saying minus one plus one computation is independent. So, instead of that in the numbers if it was some string what would you do? So, you have to identify some linear order and some identification of how much how far can you go in the linear order. So, that is the definition of local distortion that you have to figure out based on the application of the data base. So, why does it not suffice to consider closure at the subscriber? So, when you go back to the no so the theta layers. So, to me so you are considering the induced sub graph on these LA layers. So, in the new graph the distance is going to still be large between the nodes in these things that you are removing the things which are in between the alpha layers. So, your question is why do we need version of minus? Yeah. So, the way you prove that if you take a constant number of layers they will have boundary trivets these like. So, suppose your V is here this is the theta layer and this is the 2 theta layer. So, the induced sub graph is this. So, what you want to prove is that this disc has boundary trivets. You are taking the things in LA. No, LA is somewhere in between LA is limited, but you are taking theta layers before LA and theta layer after. I mean this is just a different way of writing this in the. Maybe I should talk to you on that. So, I suggest that we conclude thank you for it again. Thank you very much.