 Hi everyone, my name is Ilon Yogev, this is a joint work with Liran Kacir and Clara Shekelman and it's about interactive proofs for social graphs. So social graphs, as you all know and probably use, they affect the billions of people around the world. They also have become a modern approach to study society and in general human relationships. There are many many different public companies that have social graphs and you see a lot of examples here in the picture and I also include in this things like DBLP, search engine or many other virtual assets that can also be modeled as a social network. It's very important to keep track of the health of a social graph. Companies public reports with different measures about the health of their networks. We're going to talk about a few measures but one main one is just simply the size of this graph or network. So this is the number of nodes in the network, the number of active users. For example, Facebook has acquired WhatsApp for the steep rise of 16 billion dollars and this was computed by a $40 evaluation per user. You can see here from Facebook's report of this year that they reported that they have over 1.73 billion active daily users and 2.6 billion monthly active users. So these are from Facebook's reports. Of course companies might possibly have some incentives to cheat or lie in these reports, financial or political or otherwise. And we can ask should we trust these reports? And I guess one answer should be at least it sounds crucial to have an independent estimate of these measures. And there are two main challenges that come with this. One is that the graph is huge and this is a challenge not only computationally but also because the access to this graph is limited. So we're not just giving a file with all the information of the network. Instead the way we can access these networks is by what's called the public access or external access. So this is the access we have with these networks. They usually include two things. So the first is membership queries. So this means that I'm given a user's ID. So I'm given some ID from some huge universe of possible IDs. And I'm giving this ID and I can check if this ID exists or not in the network. And usually if it exists I also get some metadata, some profile page of the user containing some information about this user age and other stuff. The second kind of query is the neighborhood query where again I provide the ID of a user and I get a list of neighbors. Okay, a list of its neighbors in the graph. And if this list is too huge you can think of also like I provide the ID and I and I get the ID. So just there's a lot of works relevant but just really a brief history. There exist works that use this public interface to get an independent estimate of the size of the graph and also of many other measures. Just for example, Yi-Ital and Katsir-Ital, they use this public interface. They did something like Paulien queries. Okay, so entered alpha queries for some like small constant alpha like half of input by Katsir to one over four. And using this they were able to estimate the size of the graph. And our main question is can we get such an independent estimate of the size of social graphs or other measure using few queries to the network. So not Paulien, more like polylog N, but still having no trust in the graphs provider. And to answer this we introduce interactive proof for social graphs. So here we're going to communicate with the pulver. Okay, and we're going to use this Oracle access to the graph, but again we're not going to trust the pulver. So we're going to use his help and conditional power, but we're not going to trust him. So the model is like this, we have a verifier here, the verifier can interact with a pulver. The instance that they are talking about is this graph G. The verifier has only Oracle access to G. And again this Oracle is defined by the two queries we talked about. So these are membership, possibly metadata and neighborhood. The pulver of course is all powerful has full access to the graph and everything. Okay, so this is more like an interactive proof of proximity because the verifier cannot even read the whole instance. Okay, it's not only about not having a witness. Let me raise some immediate criticism about our model. So one thing we are assuming is that the Oracle answers are returned truthfully. Okay, and you can ask why. If you don't trust Facebook to give a report that when you query the graph, you know, they are the same ones that are giving you these answers are implementing this Oracle. So why do we trust the Oracle answers? So those are reasons. First, it's very hard to distinguish between legitimate user queries and the verifier queries. So it'd be hard for Facebook to cheat in these answers only for us and not for other users. And this actually forces them to cheat, to materialize their cheat, okay, in the network itself. So if they want to cheat and claim that they have millions of users and they don't, they actually need to create these users and make them consistent in the network. And this possibly you could catch them later. Okay, so what is our main result? Our main theorem is the following. You give me some graph G. It's an n vertex graph. It has mixing time tau. I'm going to talk and define mixing time later. But for now, it's more or less how many steps you need to take in the graph to get close to the stationary distribution. And it has average degree delta. And the result says that there is a doubly efficient interactive proof in the social graph model, okay, the model that I just presented, for estimating the size of the graph. And it has the following properties. First, we have some approximation error epsilon. So you can just plug whatever epsilon you want. This is the error that you allow. And we're going to verify that the claimed size. So the proof is going to claim that the graph is of size n tilde. And we're going to verify that n tilde is very close to n. So at most one plus epsilon and at least one minus epsilon times n. The protocol is two message and it's public coin, which is great. The query complexity of the verifier, maybe the most important parameter here, is small, so it's one over epsilon squared, okay, times this mixing time tau and times this delta. And it seems that this is inherent, but not completely clear. And it's also, as we said, doubly efficient. So the prove running time is also efficient. It's n times one over epsilon squared, which means that it is for small, for, let's say, constant epsilon. This is quasi-linear. There are several applications of this main result. I'll just talk shortly about a few. So first, you can estimate not only the size of the graph itself, but any sub-graph. For example, if in the metadata that you get for a user, you know the user's age, then you can actually check how many users are between the age 10 and 20. Okay, just treating this sub-graph. Other health measures that you can get actually as an application of the Armin theorem is the distribution of the degree, the median of the degree, what's called the local clustering coefficient, and many other measures. And in general, for any function f, you can estimate the quantile, let's say the median or other quantiles, of these values of f applied to all the nodes of the graph. One last application is the, we can use the Fiat-Chemure transform. So this is a transformation that, in the random oracle model, is going to compile these interactive proofs to non-interactive arguments. So these are, the soundness is going to be computationally now. And these arguments can be published once, okay, and be publicly verified by any user, any single user, using a few small number of queries to the graph. So if you go back to this report by Facebook, they can now, after claiming that they have this many billion daily active user, they can just add a small poof here, which could be later verified by any user. Hopefully, maybe this will be part of the next generation regulation, like GDPR, just only for social graphs. Okay, our protocol. So we're going to show a protocol by first showing the same protocol for general sets. Okay, and then we're going to see how this applies to social graphs. So I'm, we're given a set S, okay, we're giving some number N tilde, where the poover claims the set S is of size N tilde. We have membership access to the set, so for any element we can check if it's in the set. And we have this query D, okay, which gives me uniform samples. So we have some algorithm, we apply it, we run it, and the result is a uniform sample in the set. These are the two access queries we have to the set. Okay, and later we're going to see how we implement this in a social graph. So our goal is to verify the claim of the poover that indeed this N tilde is very close to the size of the actual set. So this is both lower bound and upper bound. So we want to distinguish between these two cases. Okay, so it's only a small epsilon, like one percent difference between what the poover claimed and the actual size. And our inspiration for this, if you know it, is the protocol of Godbusser Sipser that actually provided as part of the poof a lower bound for the size of the set. So let's see what we do. The verifier is going to send the poover, a hash function H, and an element Y. This hash function H is going to be a hash function from the universe to this set of numbers between 1 and N times epsilon squared, so something smaller than N. And Y is going to be uniform in this range. And this hash function we're going to, for now, assume that it's just truly random. Later we're going to see that we can actually rely on polylog wise independence, but until the end, let's just assume that these hash functions are random. The poover is going to reply with this set Z, which is all the pre-images of Y, so the pre-images of some random element in the range, but of course the pre-images that are in the set. The verifier is going to check that this set of pre-images is in the set and that it's actually pre-images of Y, so just applying the hash function to all the members of Z. And it's going to accept if and only if Z is more or less, there's some parameters here, 1 over epsilon squared. So if H is random, we expect, and the set is indeed of size N, then we expect about 1 over epsilon squared to fall into this specific Y. And the verifier is going to accept if it's close to that. If the set was very small, then you're not going to have so many elements that are going to lie on this specific Y. And the complexity of this protocol is actually 1 over epsilon squared, so just the membership queries to make sure that all these elements are in S. The main difference between Goldwasser-Sipser, which is very similar to this, is that they relied on the difference between the completeness and soundness to be a factor of 2. So they could distinguish between a set of size N and 2N. And here we need to change this a bit to be, to notice this small factor of epsilon. So we generalize the protocol, a bit more technical analysis, but really the heart lies in Goldwasser-Sipser. But this was for a lower bound, and to complete the picture we need also an upper bound. So here the upper bound goes as follows. So this is our first attempt. The first attempt is going to do something similar to what the previous work did. The verifier is going to use this sampler algorithm D to sample square root N uniform elements. We want to be square root N tilde, the claimed N. And it's going to accept, if and only if, there exist two different indexes such that the samples are all the same. So if indeed the set was of size N, then we're going to have a collision. This is by, with sample mobility. This is by the birthday paradox. And if the set was actually much larger, so if the pover is claiming the set is very small, but the actual set is much larger, then we're not going to see a collision. So there won't exist two distinct indexes that go to the same element. This gives us some probability to distinguish between completeness and soundness, and we're going to have to do one of our epsilon repetitions and take the median to amplify the probabilities. So this is great in terms of completeness and soundness, but the query complexity is not so good. So the verifier actually needs to do this square root N queries, and the whole idea was to eliminate the need. So the solution is we're going to take these things and we're going to delegate all the work to the pover. But again, in such a way that we can trust what he's doing. So one thing we cannot do, we cannot let the pover claim that he sampled square root N elements. He's not going to do that honestly. So instead what we're going to do, as again we're going to sample a hash function h, and again assume this is truly random, and this function h is going to represent a huge amount of randomness that is going to define how we sampled the elements. So the pover doesn't have, he cannot sample randomness himself, he has to use h. So what do I mean? h is going to be from square root N to 0, 1 to the L, where L is the number of bits needed for this algorithm D to sample an element. So h of i just defines enough randomness to run D. That's all. Then what is the pover going to do? That's h. He's going to compute vi, which is just running this algorithm D, that samples a uniform element, using randomness h of i. So just h of i defines the randomness for the i sample, and then we get the element vi. All this work is going to be performed by the pover. At the end he's going to send us two indexes, i and j, so the minimum indexes, such that vi equals vj. What does the verifier have to do? He only gets i and j. He's going to compute the randomness, h i and h j, so now he has the randomness. He's going to run this algorithm D, using this randomness, and he's going to check that indeed they collide, and of course that i is different than j and i and j are between 1 and square root N. So if the set was indeed of size N, then such i and j are going to exist. And if the set was larger, such an i and j are not going to exist. And again, this only happens with some probabilities where we need to amplify this. But the main thing here is that the query complexity is now one of the repsilon. But this really assumed that h is random. We needed h to provide us with L random bits, and it had to be very random, and we're going to see how to instantiate this. But the main idea is here. So before I talk on h, I just say how I go from sets back to social graphs. So the set S is of course going to be just the set of nodes in the graph. That's fine. That's what we want to estimate the size of the graph. First the set is the number of nodes, then all the nodes in the graph. Membership queries. Well, we just assume we have membership queries. So this is directly via the public interface of the network. Sampling queries. This we did not assume we have. However, we can implement this thing. So we can implement this algorithm D that samples uniform elements in the set by random walks. So if we're performing random walks in the graph, it's going to give us random nodes in the graph. So how is this performed? So we start with an arbitrary vertex. So we have some vertex. We're going to choose a uniformly random neighbor v. So we run the neighbor or we get the list of neighbor, choose one at random. So this is v. We're going to visit v and then repeat the whole process from v. So this is how we're going to do random walk in the graph. This process converges to a stationary distribution, where the probability that after our steps I end up some specific node v is going to be the degree of v over n times delta, where again delta is the average degree in the graph. And so what is the mixing time? So graph G has a mixing time tau. If after t iterations, you are close in, let's say, statistical distance to the stationary distribution. So after tau steps, I'm like, let's say, some constant, let's say 1 over 4 distance from the stationary distribution. The constants do not really matter. You can do a few more additional steps, and this grows essentially close to the stationary distribution. Finally, we want to get a uniform element, where here we get an element with probability that is proportional to its degree. So what we're going to do is rejection sampling. So you get an element, if it has a higher degree, we're going to reject this element with higher probability and just start everything from scratch until you don't reject this element. This says that you're going to get a uniform element at the end, and the number, you're going to do this on the walk of length tau, but you're going to do it delta times. So after tau times delta steps, we can actually implement this uniform sampling algorithm. Finally, this takes me back to the hash function. So now we want to implement this hash function, and we assume this to be totally random. Before you implement this, you're just saying, if anyway you're going to do the Fiat-Chemir transformation, so anyway you rely on the random work model and some heuristic to implement this, such as SHA-256, then just use this same hash function also for the hash function for the protocol. This will be very efficient and work well. However, if you do want a theoretical result showing that what we have is actually an interactive proof, then we're going to show that we can actually rely on polylog-wise independent hash families. And the proof is the following. So what we observe is that the verifier and the work of the prover, so both together, can be described by an AC0 circuit. So the verifier is actually easy to see there, and that it can be done in AC0, and the prover takes a bit more work, so in particular, one of the heavy steps are these random walks, which are not very shallow. However, because this is a circuit, what you can actually do is you can pre-compute all these random walks and then you can walk them very quickly, because they're pre-computed. The circuit is going to be much larger, actually exponentially in the mixing time tau. However, we only rely logarithmic on the circuit size so that this is going to be fine. Then we're going to use a theorem by Bravermann that says that polylog-wise independence fools AC0 circuits. These two together say that you can actually replace the hash function with a polylog-wise independent hash family, and this probability of the verifier accepting or rejecting the probability that there's going to exist a proof that the verifier accepts is going to be more or less the same up to how much the distinguished probability of the AC0 circuit. So we're going to use a small factor in the completeness and soundness, but again, you can repeat the protocol to get the completeness and soundness that you want. Okay, just to summarize and some open problems, we introduced a new notion of interactive force for social graphs. We provided protocols for monitoring the health of social graphs in this model. One open problem is to eliminate this dependency on delta. I think the dependency on tau is inherent, but the dependency on delta, that was dirty. I'm not sure if it's inherent. Can you use our protocol to monitor other more important health measures? And finally, can we make this part of a standout, a part of how we regulate social companies? Thank you very much.