 Okay, so the topic of my talk is on identifying central individuals in a network, and it's focused on understanding diffusion in networks, and in particular, we're going to talk about gossip today, so the role of gossip in helping identify prominent individuals, and I'll sort of explain that. So this is joint work with Abhijit Banerjee, Arun Chandra Sikhar and Esther Dufloh, and is a part of a long project we've been doing in southern India, and this is sort of a new piece of it, and I'll sort of explain what's new about it, and give you a little background on the project as well. So the basic questions that we're interested in understanding are how social learning works, and in particular, why, as a social scientist, am I interested in networks? It's really because they shape people's behavior, and understanding what information people have access to and how they behave depends on social context, and so we have to take the social context into account, so that's sort of the driving force behind understanding networks, and here what we're trying to understand is word of mouth learning, and so even in a world where we have all kinds of social media and so forth, a lot of the information that we get is through other individuals, so it's through the connections that we have, and whether you buy a new product depends on whether people have told you about it or told you that you should buy it, et cetera, so there's questions about what we're aware of, and we need to understand the network structure in order to get that. So here, although the internet amplifies information flows, word of mouth is still of fundamental importance, and if you want to really diffuse something, it's pretty clear that you have to sort of seed things in the right places in the network, and that's really the topic of what we're looking at. So which measures capture influence and information propagation? So what's the right way of measuring that, if we want to understand diffusion? And what we're going to do in this paper is sort of follow up on some of the earlier work we did, which is to develop a family of measures which can be used to characterize people's position in a network with respect to information diffusion. So who's an important person in a network for diffusing information? We're going to test that empirically, and then we're also going to ask a question of whether or not people in a network are aware of other people's positions. So am I able to name the people who are really central in my network, and how might I know that? So the question here is, you know, are we aware of people who are influential beyond my own neighborhood, beyond the people I'm directly connected to? And if so, how does that work? And that's where the role of gossip is going to come in. So we'll talk about that. OK, so please feel free to hop in and ask me questions at any point in time, because I'm going to sort of flip back and forth between data and theory. So there'll be some theorems, there'll be some data, and I'll be mixing them back and forth. And anytime anything isn't clear, please let me know. So the background, we started working on this in, I guess, 2006, 2007. And we were working in rural Karnataka. So this is southern India, roughly within a 200 kilometer band of Bangalore. And what we did is we went into a series of villages, 75 different villages in Karnataka, each with a population of about 1,000 people. We went in and mapped out social networks in those villages. And we did that before a bank went in and started offering a microfinance program. And the reason we were interested in doing that and the reason the bank was willing to partner with us is that they were having extreme variation in the success rates they were getting across villages. So they were going to two villages that looked pretty identical to them in terms of population and profession and income levels and so forth. And in one village, they would get almost no participation. In another, they'd get fairly high participation. So actually, the data we have, the variation is between 7% of eligible households taking up to about 44%. So there's a fairly large variation. And for them to break even, they need about 20% participation. So if they go into a village and they get less than 20% participation, it really doesn't make sense for them to go into those villages. There's enough fixed costs and so forth that they need some economies of scale. So what we want to do is understand why they're getting this large variation across villages. And they're really relying in these villages on word of mouth and getting information out. So the information propagation happens by word of mouth. And effectively, these are villages where the per capita income is about, it's a little less than 50 rupees a day, so it's about a dollar a day. So they're fairly poor villages. A lot of people don't have access. The median person is fairly uneducated. They don't have access to internet and so forth. So there's a lot of word of mouth which is necessary in order to get news out. So what they did is they went into these villages. They would tell a few individuals and then ask those people to spread the news in the villages. And then we tracked out the networks and then tracked mapped out who participated in these programs over time. So we can see the diffusion happening and then estimate models and try to understand exactly how that was working. Okay. So in terms of outline, what I'm going to do is talk a little bit about the empirics. Then I'm going to define centrality based on a particular process. And then we'll come back. So I'll show you some theorems on that. And then I'll talk about how people are aware of the centrality of other individuals in the network. Okay. So let's start with the empirics and how does this information spread. So this is the paper. So the information diffusion stuff is a paper we published last year. And so there's the 75 villages. This bank, BSS, entered 43 of them and offered microfinance. Eventually they came to another, a bunch past that. But we're going to work with the 43 villages that they had entered up through about 2011. We surveyed the villages before entry, observed network structure, and then we tracked the microfinance. Okay. So in terms, if you don't know where Karnataka is, it's this area here. Most of the villages were fairly close to Bangalore. So most of the villages are in this part right here. They're relatively isolated. People are, you know, they're suriculture. So silkworm production, finger millets, some pineapple production, rice. You know, there's a variation in different kinds of production. Fairly poor. I can show you some demographics. So what we did, and this is a picture of one of the villages. These little dots are individuals. They're clumped into households. And then what we did, so for instance, this is a borrowing network. What we did is we'd ask them a question, who would you go to to borrow 50 rupees for a day? And that paints out one of the networks. So here, if there's an arrow between one individual and another individual, that means that that individual said that if they had to borrow 50 rupees, they would go to another, to that other individual. So we have that kind of information. And what we're going to do is aggregate things. So we're going to treat nodes as households because you're allowed one loan per household. And we're going to treat those as the decision-making units and communication between households. And we asked a whole series of different questions. So this is who would you borrow 50 rupees for a day? This is, you know, who do you go to temple with? Who do you ask for advice on important decisions and important matters? So there's a series of different questions. And you can see that the actual, the network's very substantially in terms of the density and actually in some of the patterns of connections. What we're going to do for this, for the purposes of today's talk, is aggregate the networks. So we'll treat two households as connected if anybody answered yes on any particular question. So we'll just aggregate up. And if you said yes to any question about another household, then there's a link between those two households because we're interested in information transmission. And so if they have any contact, we'll treat that as contact. Now, what that does miss is, you know, if people have yes in all 12 questions, then they're probably talking a lot more than if somebody says yes on one question. So we're not going to do a weighted matrix. We're just going to treat things as zeros and ones, either connected or not connected. Okay, but I can say later on, I can tell you how these, you know, some of these networks are much better at making predictions than other networks. And I can tell you a little bit about that if you're interested. Right, right, right, right, right. Yeah, yeah, exactly. So we're going to treat it as undirected and we're going to treat it as undirected because all we're interested in is whether they can have a conversation where the information can pass. But you're right. Some of these are unidirectional and it's a little, it's actually fairly difficult to tease out whether or not things are just one way. So for instance, we also ask questions, there's a lot of noise and survey methods. And so we ask questions about your relatives. So we ask people to name relatives outside the household. So for instance, who are your cousins that aren't in the household. And there you get about a 64% reciprocation rate. So, you know, I'll name somebody as a cousin and they won't name me back. So, you know, there we know that it has to be a reciprocal relationship and they're not always, you know, sometimes you get directed arrows out of that. So there's a lot of noise in the data. And so part of the reason that we don't sort of pay attention to direction is we're not sure that we can trust the direction. And actually the reciprocation rate is pretty similar across the other networks. So the advice network has about the same reciprocation rate as the cousin network. So it's not clear that there's a big difference in terms of whether we should pay attention to it. But it is something that if you had better data you could look at. And here I just, I'm not sure I trust the data enough to look at that level. So one of the main networks that actually works really well is the kerosene and rice borrowing network. So who do you go to borrow kerosene or rice? Who comes to you to borrow kerosene and rice? We also have who do you go to in an emergency for medical help and so forth. So there's quite a few questions. You can, you know, this is a different coloring of the network. These are, again, are the households. Here I colored them by cast. So the red and blue are differences between whether they're schedule cast and schedule tribes, which are the ones that are recognized for affirmative action or not. And there you see a fairly strong homophily, a fairly strong cut in the graph. So there are going to be patterns in the graph that are going to be quite strong. Here, you know, the probability of a link within is about 9% and the probability of a link across these two groups is about 6-tenths of a percent. So you see pretty strong schisms in the networks and that's, if you cut this by Hindu and Muslim in some of the villages it's even stronger. So there's some very strong patterns in the villages and that'll make a difference in some of what I say as we go along. Okay. Data also include, okay, so we have a number of households composition, demographics, wealth variables, self-help group participation, ration cards, voting. So we have a lot of demographics about the individuals in the villages as well. And I can tell you at the end or if you're interested in what I'm going to do is focus in on the network stuff but I can also tell you how a lot of these other things matter. So a village typically is about a thousand people and roughly, I think, you know, the average number of households is 196. So I say 196 households. The number of connections that they tend to have, so you know, these graphs look pretty dense. It's usually about 15 other households that you're connected to. So 15 out of 196 households would be the typical number of connections that a household have. And there's actually a lot of overlap in that. So they tend to be, you know, fairly clique-ish and clustered within that. So there's a lot of closure in the network. Okay, so in information passing, you know, do initial injection points matter and then how should we measure that and what kinds of things can we say about measuring things? So first, just in terms of centrality measures, I'm sure some of you know this quite well but others don't, so I'll go through this just some basic ideas. So, you know, one way to measure centrality is just to count the number of connections that people have, right? So their degree, the node's degree is the number of connections that it has. So, you know, here this is the most central node, seven, this one, six. So having more connections makes you a better node for disseminating information. Maybe it's more central. So that would be one hypothesis. The problem with that kind of measure is it misses, it doesn't really capture position in a network that well. So, for instance, here, you know, these two nodes both have degree two and yet there's very simple ways in which we can see that they're not equally central or they wouldn't be the same in terms of trying to disseminate information. So this one is better connected. It's connected to a seven and a six. This is connected to two twos. It's not reaching the network. It's not reaching as many nodes as you go outwards. And so we want measures that are better at picking up information diffusion. So, for instance, if the bank went in and told this individual about the loan program and started spreading information from that person, it's more likely to have success than if they started from this person, for instance. That might be a hypothesis that we would have. So we want a measure that picks that up. And so the standard measure that's used in this is what's known as eigenvector centrality. So the idea here is you say instead of your centrality being proportional to how many friends you have, you say that centrality is proportional to the sum of the centrality of your friends. So if I have more central friends, I don't just sum up the number of friends. I sum up how important they are. And so now I've got a measure. The difficulty with this is now it's a self-referential system. So I have to find a fixed point of this. But luckily this is a simple calculation. We're saying that the centrality of a given individual is equal to the sum here. If we let this be the matrix of connections in the network, if i and j are connected, there's a 1, 0 if they're not. Then I'm just summing up over j, the centrality of the j's. This is just an eigenvector calculation of this matrix. So what we've done is now we've got a measure which says that the centrality, the vector of centralities, is just an eigenvector of the adjacency matrix. And in particular, we want one that has all non-negative entries. And the Peron-Ferbenius theorem will tell us that there's a nicely defined solution to this, and there'll be a unique eigenvector, which will be the one that we're interested in that'll have all non-negative entries if this is a non-negative matrix. So there's a well-defined measure of centrality based on I'm as important as the sum of my friends' importance. And when we look at that, then we end up with things that are related actually to page rank. So if you go back to the way Google defined page rank, that was based on an eigenvector sort of calculation and iterative solution to that. If you look at Markov chains and look at steady states in Markov chains, if you bounced around a network randomly following links at each point in time and counted the amount of time that you would spend at any node, that would be proportional to this measure as well. So it's sort of a well-established measure in terms of the way people have looked at it in the literature. And when you look now at these two nodes, then the numbers, this one is roughly three times as central as this individual. So it's picking up the idea that this one is better connected because it's connected to this 0.39 and the 0.5 and this one is connected to 0.16 and 0.17. And so it's picking up this idea that there's some differences in there. Okay, so back to the data. So what the bank had done was go into each village. They talked to a few people in each village, usually between five and 15 people, roughly usually around eight, nine or 10 individuals in a village. What they would look for is school teachers, self-help group leaders, and shopkeepers. So the bank employees were told to go into the village and look for specific kinds of individuals that they thought would be good people to spread information out. And so what we're going to take advantage of in terms of the empirical work is then the fact that in some villages the teacher happens to be very central and in other villages the teacher happens to be someone who is more peripheral. So they were looking for particular labels, but that ends up with variation in terms of the network position and then we'll be able to use that variation in the network position to see whether if they happen to hit a teacher who is very central did that work better than if they hit a teacher who was more peripheral. Okay, so that's the identification strategy here. Okay, getting used to this. Okay, so two hypotheses just to start with. One, higher degree centrality is going to give higher diffusion. And the second is higher eigenvector centrality will give higher diffusion. So if we wanted to understand the diffusion, if we look at the villages where they hit people who had higher degree centrality, did that help spread information? Does that give us a better diffusion and take up of microfinance or is it higher eigenvector centrality? Yes, exactly. So if you look at a non-negative matrix and you look for an eigenvector, so there's going to be as many eigenvectors as entries generally. But if you're looking for one with all non-negative entries, then that's going to be unique. And it's the one that's going to be associated with the largest eigenvalue in absolute terms. So if you look at the largest eigenvalue in magnitude, there's conditions under which it's going to have all positive entries, but generally it's going to have all negative entries. And if the network is acyclic and, sorry, aperiodic and strongly connected, then you're going to end up with a unique all positive entry one. But generally it's going to be non-negative and there's a well-identified eigenvector to work with. I was just going to say conversely, couldn't it be considered that if one's neighbors know a lot of people? Yes, yes. So there's a lot of models you could think of in terms of what the right notion of influence is. And we're going to come up with one in just a moment. So I'll go into a specific model which will actually tie into eigenvector centrality. So right now what we've done is just sort of picked two of the most prominent measures that are from the literature. And what you're saying is, well, maybe we want to do something else. Maybe we want to actually figure out exactly how people are communicating and model that and think about that. And maybe people with more degree means that they don't pay attention to people. And so I'm not going to be as good at spreading information. Yeah, yeah, yeah. So there's going to be a lot of different hypotheses that we can come up with and test. And I'll go through another one in a moment. And so what we're doing right now is just pulling off the shelf to centrality measures. We'll see that one does not so well. The other does better. And then we'll produce one that does much better than either of these. And then you can come up with almost an infinite number of hypotheses as to what the right model would be. And so we've run a few more models. And I can tell you about them. But I'll show you another one in a moment. Okay. So first of all, if we look at the degree, it really does nothing. So what's the point here? A point is a village. This number down here is the average degree of the leaders of the self-help group leaders, the teachers, and the shopkeepers that they first went into. So we just count what was the average degree. And then here is the microfinance participation that they eventually get. So down here is the 7%. Up at the top is the 44%. The village they had the most take up. And you see that there's really no relationship. If anything, you get a slightly negative fit, but it's insignificant. So there's no real correlation between the degree of the points that they started with and the eventual take up. If you go to look at the eigenvector centrality, then you get a significant relationship, positive relationship between the eigenvector centrality of the take up of the initial injection points and the eventual participation. I can show you some regressions on that. So if you look at this, so if you regress the percentage of people who ended up taking up microfinance on different centrality measures, then eigenvector centrality gives you a positive significant relationship. Degree gives you nothing. You can throw in, so people always ask about their other favorite degree or centrality measures. There's a bunch of different measures from the literature. So closeness, you sort of measure, you sum up inverse distances between different individuals. So how close am I to all the people in the village? Between this, do I sit on a lot of paths between other individuals? Bonacic centrality, I won't go into the definition of that, but it's somewhat related to eigenvector, but these other ones don't do so well. In terms of the r squared, the amount that you're predicting, eigenvector is doing a little better than the other ones, but not a whole lot. So you're explaining about a third of the variation overall, but it's not doing much better than the other explanatories and most of a lot of the explanation is coming from numbers of households, a self-help group, characteristics of the villages in terms of other demographic variables. Okay. So now let's talk about defining centrality by a particular measure. So what I'm going to do is define what we call diffusion centrality. So what does diffusion centrality do? It just thinks now carefully about a model of centrality. And what we're going to do is say, let's suppose we start with a particular node and we seed that node with information and then we ask them to pass information along. Now let's suppose that each household interacts randomly with some probability p in a given period and interacts for some number of periods. So we'll keep track of two different variables. One is what's the probability I tell my neighbors about something I've heard about and then how many times do I go through this process of broadcasting information? Okay. So we know that I is initially informed. Each informed node tells each of its neighbors with some probability p in each period and then we run this for t periods. And we'll call that diffusion centrality. Okay. So here what we're going to do is we start with some particular node. Let's say that we tried this with p equals 0.5 and t equals 4. If this were the first node we introduced, then we can just say run a simulation. It's going to randomly flip coins to tell any neighbor any given period. So we run the first period. What happens? Well, it tells one of its neighbors. Now this neighbor knows. We run it. One period is run. Now the second period, this person can tell people. Again, this person can randomly tell people. So we do that again. It spreads further. You keep doing this, three periods, four periods. And then we count up how many nodes we get. So diffusion centrality, this person would have a diffusion centrality of 13 from that simulation. Okay. If you picked a different node, ran a simulation on that. You could do it. This person would end up telling six after things. So one possibility would be just to keep simulating it. And it would say this node is more central than that node by a factor of roughly two under this diffusion centrality measure. So now what we've done is tried to model the diffusion process directly to define the centrality. The way we're actually going to define this for various reasons is instead of running simulations on this, we're just going to define it by hitting the matrix g with a probability p. So this just represents the probability that I interact with each neighbor at each point in time. Then we're going to raise it to various powers and then sum that up over the t periods. Okay. Now, what's the difference between this and the simulation that I just talked about? This will do some double counting. So this will allow a given node to be told four times or something. So I'm counting how many times, if I was broadcasting information, it would hit each. So how many times does a given node here something that would have come from the initial node? So the ijth entry of this is then how many times j would have heard some piece of news that i started with if this was going on for t periods. Okay. So it's not just counting whether you're informed or not, but how many times I got informed from or how many times I would have heard information through this process. Okay. Is that clear? Yeah. Right. So what we're going to do now is, so what this measure is doing is instead of working with eigenvector centrality, we're instead trying to define which is the centrality of each node by, so we can do this calculation for every node in the network. Right. So this matrix now, so what we're going to end up with, I'll mark this over here. Right. So the entry here, if we look at the ijth entry, this will say how many times, so if this is, you know, two, it would say that j has heard two times information that came from i and so forth, right? So I would get some number of information to one, you know, I don't know, 2.6 and so forth. So we've got a whole series of entries here. These various entries are telling how many times there's node one here, something that i told, how many times there's node two here, something that i told and so forth. And then when we sum this all up, we get this number, say 13, which says node i manages to get this many pieces of information out to other people in the village. We can do that for every, you know, i, k, et cetera. So each person ends up with some number, and then we can do comparisons of this. And these comparisons will tell us, you know, who's the best person for broadcasting information, effectively, by just running a model. And so we're not using, so, you know, there's also eigenvector centralities for each one of these individuals, and instead there's a number here, and these numbers are going to be slightly different from the eigenvector centrality, and we'll use those to do prediction. Is that clear? Okay, so what's, so now back to the new paper, so I'm flipping back and forth between the empirics and stuff, so I'll tell you some results about this measure. So what's the relationship between diffusion centrality and these other measures? So one thing that's pretty clear, if we ran the diffusion process just one period, then the number of people I'm going to tell is just going to be proportional to how many neighbors I have, right? It's just going to be p times, if I have six neighbors, the expected number of people that are going to hear anything is just p times six, so it's just going to be proportional to six. If you have 12 neighbors, it's going to be p times 12. So what's happening is I'm just broadcasting information, if I just do it once, it's degree centrality, right? If communication happens infinitely many times, then this diffusion centrality is going to converge to eigenvector centrality. So in particular, the first theorem we have, if we look at this diffusion centrality, if t equals one, then diffusion centrality is proportional to degree centrality, so it's exactly the same as degree centrality. If I'm just randomly talking to my neighbors at any given period with some probability, more neighbors, more dissemination. If p is bigger than one over the first eigenvalue of the matrix, and this is really the point at which if p is bigger than this, then what happens is you expect the news to saturate in terms of eventually everybody will hear if I run this in an infinite number of periods. As t goes to infinity, diffusion centrality converges to eigenvector centrality. So this is the main mathematical part of the content of this theorem. So what we do is we prove that the diffusion centrality converges to eigenvector centrality if you run this an infinite number of times. But if you're varying this t, then what's happening is you're interpolating between just telling your immediate neighbors, which is just going to be degree centrality on the one hand and then the other extreme, you get eigenvector centrality. So this family of measures is spanning between degree and eigenvector centrality. If p is less than one over the first eigenvalue, then this sum actually converges. And what it converges to is Katz-Bonacic centrality. So depending on whether this is divergence or convergent, you either get eigenvector centrality or Katz-Bonacic centrality. So if p is really low, then I end up not actually getting much information out. Things die off. And that's what's known as Katz-Bonacic centrality. I won't go into that in too much detail. So this is the main content of the theorem. The theorem's proof is fairly routine in terms of the matrix algebra. What we do is we approximate the matrix by a diagonalizable matrix. Then we diagonalize it and then prove using the spectrum of the matrix, we prove that this thing converges to the first eigenvector. And then we use a continuity argument to go back. So the mathematics of it aren't very deep. It's fairly easy in terms of the matrix algebra. But it's showing that this thing is spanning between eigenvector centrality on one hand and degree centrality on the other hand. Any questions on that? OK. So now we can say hypothesis one, degree centrality to eigenvector. Now let's see if diffusion centrality does better than the others. So two things. One is now we've got a measure that requires a P and a T. So it's no longer a measure which is independent of parameters. Now we have to say how many periods does this run for and what's the probability that people talk to each other? So what we did in terms of estimating diffusion centrality is we used the number, for T we did it proportional to the amount of time that different villages were exposed to the microfinance program. So some villages were exposed for two years, some were for just one year and so forth. So the number of time periods we scaled by the amount of time and in particular we did it by quarters while actually trimesters. So we did the number of trimesters that a given village was exposed. In order to estimate the P then, what we did is we actually estimated the P from the data. So the P, we tried to pick the diffusion centrality measures fitting the diffusion model to best match the eventual participation rate. And so the estimated P is about 0.2. So it looks like people talk to their neighbors at about 20% probability any given period. Exactly. Yeah, yeah, yeah. Right. Is a village that had P equals 1? Well, so what we do is each village ends up with a number. So each village ends up with a time. And so we have a village, village 1. Maybe their participation rate was say 18%. Their T was equal to 4. Maybe village 2 had a participation rate of 33. And T was equal to 6. Village 3 had a participation rate of 14. And T was equal to 3. And so forth. So each village has this. The point is, for the first four periods, the 33 is coming, part of the 33 came during the first four periods. Right, right, right. Sure, sure, sure. Exactly. And in fact, we could get better fits by using the time series. And what we did, there's a whole appendix to the original paper where we go through sort of fitting time series. And you get little better fits if you do the time series. The time series, it's more cumbersome. So we're going to work just by fitting the end numbers. But you're right. You could take advantage of the time series and it gives you a more accurate P. Okay. So what do you see? Now you see, again, a statistically positive relationship given that this nests the other two measures you should expect this. In terms of the significance, it's highly significant now at the 99% level. The most important thing is the r squared has gone up fairly substantially. So now you're explaining almost half the variation in the data. So it's doing a lot more explanation. One caveat, of course, is what we've done is we've expanded the, you know, so now we've got an extra dimension of freedom in terms of fitting the P to the data. And so we're fitting part of this. And it's not, given it nests these other measures, it's not surprising that we do better. But it is picking up something which is going on in the data and is a much better predictor of what's actually happening in the eventual take-up. So this diffusion centrality does better than, you know, sort of pulling off the shelf centrality measures in terms of matching the actual variation. We actually did both. So we have another, so if you fit the T as well, so you can fit both, we wanted to sort of tie your hands not to over-fitting and allowing too many parameters. But at least, I mean, you could have this, this, the discipline, the four, six, and three, but then have a constant, because you've got a choice here of non-period. Right, so two things. One is if you just allow rescaling of the T's but in proportion to these, then that makes very little difference. So, you know, doubling this, basically what happens is the P scales down, and so you don't get a whole, you know, you get a whole ridge where you're going to get almost equal variation on it. If you do a fit, sort of just, you know, best fit, fitting both P and T, you get a slightly better, I think this goes up to, like, .454 or something, but it's not a huge improvement. And so we just wanted to do it in a way that was, we weren't freeing up too many parameters. But you could free up, you know, you could free up another parameter, and it increases the fit and does a little better than what we're doing here. Yeah, so actually, yeah, yeah, yeah, yeah. No, right, right, so the actual T's, in fact, are very close to this. So if you do the actual, yeah, yeah, yeah, so it would be between two and five, I think, so the range of optimal T's. And what you can do is actually fit, right, fit different T's per village, too, and they come out, you know, two to five, basically, in that range, yes. Okay. So let me talk now about the gossip part. So how do people actually know? So, you know, what we did here is we've got this diffusion centrality measure so we can say, okay, look, if you wanted to go into a village, what would be good? Go in and find the most diffusion central people in a village, tell them about microfinance, and that would spread things out nicely. Well, what's the problem with that? The problem with that is in order to do that, you have to go in, first of all, map out the villages, then, you know, go through, figure out what the diffusion centrality is and tell them, you know, who's the most diffusion central individuals. The difficulty is, of course, the individuals who are, if you went in to map out the villages, you're already talking to all the people in the village, you might as well just tell them about microfinance, right, so it would be a very expensive way to go through this process. So, you know, we're interested in sort of identifying these central individuals, but without, you know, are there some other ways of identifying it? And so one way we thought about was, well, you know, let's just ask the people in the village. So if we go into the village and just ask them, you know, literally sort of take us to your leader, who would they take us to? And we wanted to pose the question, so actually we've done follow-ups on this. I don't have the second wave of data, but we asked them and sort of, you know, who are the most important leaders in your village, but also if we want to get information out about a new product, who should we talk to in your village, right? So we can ask them a very directed question and see how they answer. So what we did is we asked each adult in the village who's the best starting point for diffusion. We're, again, working with the households as units. Here we're going to work with the subset of 33 villages, again, about 196 households per village. So we worked with a subset of the villages that we could survey more recently. So if you want to, these are the questions. If we want to spread information about a new loan product, who do we speak to? If we want to spread information about tickets to a new event. So we asked two different questions just to, you know, one about a loan product, but we also were worried, maybe they thought loans were special and they might act differently about that. So we also asked them about a, you know, spreading information about an event or a fair to refer and see if they asked differently, answered differently. So what time do we have until the... Okay, okay, so we got plenty of time. Okay. So interestingly, and I can give you our hypothesis as to why, a lot of people didn't want to answer. So a lot of people refused to answer the question. So we got about half the households were willing to name somebody for loan diffusion about 40% for event. And here I think the, you know, they're afraid that if we asked them to name one household, that they might insult somebody by naming one house. If somehow the household found out that they named them and not somebody else, maybe they would insult people. So there was a substantial number of people who did not name anybody. They just refused to answer the question. When you actually look then at the number of households that were named, it's a small proportion of the total number of households. So it's about 5% of the total number of households who are actually named for the loan and 4% for the event. So basically, you know, things looked very similar for event and loan. Conditionally upon being named, you're named nine times on average. So that means if somebody named me, then actually I was likely to be named by lots of different households. So there was a lot of consensus in terms of who people named. Once they named a household, that household was likely to be named by lots of other households. So at least they're agreeing on who the more, the best diffusion households are. So they're focusing on a few people. Then we can ask them, you know, are they naming highly central people? Are they naming these diffusion-central individuals in the village? So here's what they're... So first of all, a little bit of background. So we can ask first, are they actually naming people who are beyond their immediate circle? So maybe they just name somebody out of their immediate friends. So this is the shares, this is for the loan question, this is for the event question, and this is how many people they have in their immediate neighborhood overall, right? So they tend to be connected to about 10% of the population and they're naming people inside their neighborhood about 20% of the time. So they're over-representing their immediate friends and they're under-representing people that are a distance three or more away from them. And they're hitting their second neighborhood with about the right frequency. So, you know, about 50% of the individuals they reach in about in two steps, the average distance between households in these villages is about 2.7, so they're relatively small worlds. But there is a little bit of a bias here. So they're naming, they tend to name people that they know better than people that they know less well. But if you look at the average, so let's do eigenvector centrality, we can do, it looks very similar for diffusion centrality. If you look at, say, the average eigenvector centrality, the people that they name, it's pretty similar across one neighborhood, two neighborhood, and three neighborhood. So they're naming people who tend to be in the 75th percentile on average of the eigenvector centrality measure. So here what they're doing is naming people, but they are naming people who are fairly central in the villages. Okay, and they're doing it fairly consistently when they're naming individuals, whether it's the first neighborhood or the third neighborhood. So if they do name one of their friends, they pick out somebody who's fairly central. Okay. If you look at the, when we're looking at, say, the quintile of eigenvector centrality, when we're looking at the number of people that are nominated for, say, the loan question, the vast majority of the people who are nominated are going to be in the top 80th percentile or above of eigenvector centrality, some in the 60-80 range, and then some who end up being less central. Okay, so they are picking off people that are skewed towards the higher end of the distribution. And then you can regress, say, you know, how many nominations does the household get? So how many times am I nominated by somebody else for being central as a loan diffuser, or, yeah, just stick with loan in this graph? You can look at, say, my degree centrality, doesn't seem to matter. Diffusion centrality, here we're using 0.2 and 3, which 3 is the best fit, so now we're going to the best fit T. So if we do that, then, you know, you get a significant relationship. Eigenvector centrality, you can run that with diffusion centrality. Diffusion centrality seems to pick off what is statistically significant, whether or not they're actually a leader by the original definition of self-help group leaders, teachers, shopkeepers. That's also predictive of whether you're nominated. So people do tend to nominate individuals who have a high diffusion centrality. Also, people who have status is also a reasonable predictor. We also ran some GPS numbers to try and see if they were just naming people who were central in the village in the GPS sense. And so you can do a GPS centrality measure, and that doesn't seem to matter at all. So they're not naming people who are just sort of, you know, physically central in their village. They're naming people who are socially central in their village. And this has village fixed effects. Okay. Any questions on that? So we're getting some correlation between diffusion centrality and the people that they're naming. And so, you know, one question, if they're able to name people who are outside of their immediate friend, and they're naming people who are in the higher brackets of diffusion centrality, and they're naming people who it's, you know, correlated with the leadership, but also independent of that. So it's picking up some other things. We're able to do that. And so let's go through a simple model of that just to understand this. So we're going to do, now here's where the gossip comes in. So let's think of a town gossiping. So what we're doing is now we're just starting a bunch of things. A room move to Stanford. Matt has a new MOOC, Esther won an award. So, you know, we just start spreading this kind of information out. And what's going to happen is now we're going to have people, there's a piece of information about a given individual that gets broadcast here, okay? And let's work with the same kind of diffusion model where there's some probability P that you talk to your neighbors and spread gossip. So, you know, I heard, okay, you know, I've heard that a room move to Stanford. I tell my friend that. So that gets moved, and then we're going to keep track of how many people hear news, how many times people hear news about different individuals, okay? So what we do is I'm sitting here, and I keep track of, oh, I've heard five pieces of news about a room, seven about Esther, I've heard about myself a few times and so forth. So I keep track of this thing, so I have a vector that I keep track of for each person in the village of how many times I've heard news that started at that node. And what we're going to presume is that news starts independently across all the nodes with the same frequency. So news about different individuals comes with the same rate. And that might be a bad assumption. It might be that people who are more influential and central have more news about them that might be spreading. That's actually going to accentuate this measure, but here we'll work with just the uniform. So news comes out about all these individuals that randomly spreads through the village, and then as a listener, I just sit here and I keep track of this vector, okay? And the question is, does this vector are a good predictor of how central the individuals are? So I'm just sitting here. I'm saying, okay, Esther looks pretty good. Whoever this person is, maybe it's Mark. Mark looks great. He's got an eight. So I name Mark as the most central person because I see that he's got the most, I hear the most gossip about him. So if I'm going through this, then now I've got these different measures. Is that related to diffusion centrality or eigenvector centrality? Okay, so let's keep track. Network gossip, P, T again. I, J is going to see how many times has node I heard news, sorry, J has heard news about I after T periods if there's probability P of passing. So it's closely related to the diffusion centrality, but what we've done is we've reversed, so instead of asking from a given node, how do I broadcast out, now we're asking as a listener how many times have I heard about other nodes and is that going to be related to the measure of diffusion centrality? Okay, is there any questions on that? Idea, concept pretty clear. Okay, so what's a theorem? The theorem to every individual's ranking of others under network gossip will be according to the ranking of diffusion centrality for any P over one over lambda one bigger than the reciprocal of the first eigenvalue of the matrix and large enough T. And just to give you an idea, in these villages the P, the one over the reciprocal of the first eigenvalue turns out to be about 0.03 or 0.04, so it's a fairly low probability that if you're spreading above that you're good. Okay, so we're going to get something which looks like diffusion centrality if we wait long enough, if there's enough time periods and if there's, because it does take some time at a distance from some individuals. So if we only ran this for three periods and there's somebody four away from me, I'm going to get a zero there. I'm never going to hear news about them. So it has to be that this is running for a long period of time and given that gossip might be running over a period of years, the idea would be that over a period of long time I'm going to be hearing gossip about lots of individuals and that should be working. The proof of this theorem is different, so it's not as if this, just because we're using a P and a T here, this looks, it's just a restatement of the first theorem, proving that each node listening hears things at the same rate that the nodes broadcast actually involves a different calculation but showing that if you do the calculation differently you get the same answer. So it's again a diagonalization kind of argument, but it's doing it from a different angle. Okay, with many iterations, basically everybody picks up on eigenvector centrality, diffusion centrality. For small numbers of iterations, the average network gossip correlates positively. So what you can prove is in the limit it's going to work perfectly. Everybody's going to know exactly who's the most eigenvector diffusion central individuals. In the short term there's going to be a correlation and each individual will be biased but you can prove that the correlation of the average individual is going to be positive. So the covariance of the diffusion centrality measure and the network gossip is actually proportional to the variance of the diffusion centrality and you can show that that's a positive number. Yeah, so we did both of them for the undirected but the definitions work exactly for everything that we've done here. You can actually, the proofs and everything are pretty much just a simple change of notation. Yeah, yeah, yeah, yeah. Right, so what has to happen, so for the second theorem to hold in the directed case, you need some strong connection assumptions. So it could be, for instance, if I'm broadcasting out and there's some people that hear about me but I never hear back, then there's going to be issues about, certain individuals can never hear about other individuals. But if you have a strongly connected network so that there's a path, a directed path from every node to every other node, then you get the second theorem in the directed case. Any other questions on that? Yeah, yeah, they're, well, yes, yeah, yeah. The degree that they have and the size, yeah. There's actually some Hermit households. So we have some households that are fairly disconnected that really are sort of islands but they tend to be individuals who just moved into a village and don't really have any connections. Okay, so then, you know, if we sort of flip this, we can ask them, you know, people are fairly good at actually identifying people just by asking who the individuals are and that's information that's significant even beyond including their leadership status or GPS locations and other things. So they are seeming to do this. The theory tells us in the limit they should be able to do that just by listening to gossip. Whether or not that's the right explanation, we don't know, but the theory sort of says if people are just sitting there and listening to information and just naming people that they hear a lot about, then importance actually works well if gossip is the conduit. So, you know, one thing you can do is you can say, let's look at this combination. So here what we can do is, what's the probability that you're in the top 10% of the diffusion centrality individuals in the village? And let's look at whether you were named by somebody and a leader, whether you were named and not a leader, whether you were not named but a leader and whether you were not named and not a leader, okay? And so what the red does is keep track of what's the share in the population of this set of individuals. So, you know, the people that were not named and not a leader, that's almost 90% of the population. Then if you're not named and not a leader, you're pretty unlikely to be in the top 10% of the centrality. And basically once you're named and a leader, that's only a tiny percentage of the population, so it's a little more than 3% of the population, and you end up with almost 50% chance than being in the top 10% of the diffusion centrality. So here, you know, having, just going through and looking at the leadership status and asking whether this person was named by individuals, asking if, you know, some subset of the population, who are the people that we should talk to, you end up with a fairly good predictor of their diffusion centrality. And being nominated and a leader works well, and being nominated is better than being a leader and by a factor of about one and a half, and significantly better. So you're better off just paying attention to who they name than actually looking at the status of the individuals directly. Any questions on that? Okay. So then you can, you know, look at the diffusion centrality, how well is it predict by nominations and leaders? Again, you know, this is just a regression which sort of verifies what I put in the table you saw, and GPS doesn't seem to matter much. Okay. So just in terms of pictures to sort of see what it looks like, this is one of the villages again. So if you work with networks, you work with a lot of different visualization packages. So the first one was in R, the second one was in UCI Net, this one is in Gefi. So here these are, again, households from one of the villages. And then what we can do is say, you know, these are the leaders in this village. So these are the dots of the people who are leaders. You can begin to see, again, the segregation patterns in the village. So there are different cliques of individuals who are not that well connected. Leaders sort of span through them. They're reasonably central. These are the people that were nominated in this village. So the people that were named is who people we should go to for diffusion. You can begin to see, again, so roughly half the people in the village are naming a household, and yet we were getting just seven households being named. So that means a lot of people are naming the same households. They tend to be fairly central individuals. And when you put together both, you know, the leadership status and being nominated, then you pick off people, in this case, all of them are in the top 10% of the diffusion centrality. So they really hit individuals fairly well. One thing that's true, though, is you might end up, you know, skewed towards a subset of the population. So here, you know, you would worry if we took this individual out, you might not get great diffusion here because there's a lot of segregation. And so there's, you know, different people and different parts of the social network of the village, say, different caste levels or religion and so forth. And you might want to make sure that you get a good mixture of that in order to see these. One thing, sort of in general, there's a paper, an older paper by David Kempany, Eva Tardusch, and John Kleinberg, that looked at, suppose you want to pick some subset of nodes that's going to maximize sort of the chance of diffusing something. So here we're sort of, we've worked with this measure of diffusion centrality. You could think of posing a question of picking a set of nodes to do this. And if you wanted to pick a set of nodes, you don't want to just rank the people in terms of their diffusion centrality because it might be that the first and second people are right next to each other. And so, you know, picking those two nodes might not be as good as picking this node in that top node, even if this person was third and that person was second. But figuring out the optimal configuration is an NP-hard problem. So it's not an easy problem to calculate. There's, you know, picking the optimal subset of nodes out of a big set. If I'm picking more than a few, it can be computationally quite intensive because I have to look at all these different subsets and try and figure out which is the best subset. There's lots of subsets here. Okay. So summary, you know, diffusion centrality and eigenvector centrality of the injection points is a significant predictor of informativeness. Simple model of direct behavior can help define influence. And let me say sort of methodologically where I think this is important. You know, this is a very specific application. It happened that this measure works fairly well. It might work well in other contexts also. But I think that one sort of overarching message here is that there's a lot of sort of ad hoc measures that have come up over time of, say, influence. And, you know, we have degree centrality, eigenvector between this closeness. There's all the things that have some sort of intuition behind them, but they're often defined in a somewhat ad hoc way. And I think that we need more ideas of, you know, sort of modeling exactly what the process is and defining the measures that we need based on some process or else on properties. You know, sort of figuring out what are the properties of different measures. So doing axiomatics or doing some modeling and fitting the measures that we need directly to the problems rather than just sort of having some ad hoc measures that we think are natural ones because they sort of make sense in using those. So that's sort of, I think, a more long-term methodological thing that the literature needs more of that. And then I think one thing that we know very little about is how aware are people of their social environment. And if you do some simple introspection, there's a lot that you know about who are important people around you without actually having a lot of contact with those individuals. And so here what we've done is just have the simple model of gossip, which might be one explanation of how you can become aware of a person's position by doing a very boundedly rational, simple calculation. I just sit there and keep track of how many times I hear about it and think somebody's more important if I hear more often about them. I would actually be fairly good at predicting people if the news is coming in a random manner according to this kind of model. So there might be something as to, you know, just being a listener can help people know a lot more about their network without actually ever having seen the network. Okay, so, you know, broadly measured. So I think I just went through this. So thank you very much. Yeah, so actually what we're doing right now, so literally we've got the data back. So we've been going back in. We've gone through and used the people that they named to actually run diffusion. So what we did is we've now run a controlled experiment where we looked at diffusion central individuals. We've looked at people that they named, and then we ran diffusion of a new product. What we did is we gave out some cell phones and we told them there was an opportunity to come and win a cell phone. And so we have a... Do you want to do exactly the same thing you did with the diffusion measure? So for the individuals... Yeah, yeah, yeah, yeah. ...the banks actually spoke to... Right, right, right. What was the... Oh, the nominations. Right, right, right. Yes, yes. So it's actually looking at... Yeah, so... Look at the average nominations that the banks actually... Of the people that the banks had and look at that. Yes, yeah, yeah, yeah. Yeah, yeah, yeah, yeah. And I think, in fact, I'm trying to think if we have done that. I'm pretty sure we did and I don't have that here. But yes, it's not hard to do and I think it comes out the way you expect. But I can double-check on it. Well, the question is, would it be better or worse than diffusion? No, so basically... Yeah, so pure nominations is better or worse. Yes, so, yeah, I don't know. Yeah, yeah, yeah, yeah, yeah. I don't have any expectations of which way it would be. But the other thing is, if you're asking them a very particular... You're only asking them to name one individual, right? Right. We could ask what's the best set. Well, suppose they knew everything, right? Right. If the bank went in and asked just one individual, the bank has decided they're going to tell five individuals, so you want to go in and ask somebody in the village, who are the five most important people in the village? Right, and in fact, I think the way we suggested the bank do it is a little bit by snowball, too. Because once they're talking to somebody, they might as well have already informed that person. And now they can go off the people that that person names and then talk to those individuals. If people named twice, then... So the order in which they can go can be based on the nominations that they've got, the sequence of nominations, yes. And so asking for more than one is valuable as well. And here we just asked for one per recipient. But we could ask them to rank some, say, you know, he would give us a list of five. We didn't do that. So in terms of numbers, with the original question you asked, how many people do you talk to, some problem that, you know, there's some... not bad, but the number of people you've talked to, but the number of people that you list. I'm asking a question. Oh, sure, yeah. So in fact, the... How do you find 15 people in a difficult time? Yes, definitely. So what we had is... we gave them a cap of eight that they could name. I think out of the villages, you hit that cap of eight about one to 2% of the time. So people usually only name a few. And that's a problem with this kind of survey data. In general, you know, there's some fall-off. And it's very subjective as to how many people list. And so, you know, if you have data like e-mail data... So one advantage of having, say, e-mail data or something like that is you have a factual frequency of exactly who people are interacting with and how many times. And you can list everybody and so forth. There's advantages and disadvantages. The advantages here is we can ask a whole series of different questions and look across these different dimensions. And by putting them all together, we get a fairly good picture. And part of the reason that we ask 12 different questions is it's really hard for people to answer the question, who are your friends? Do you just go into that, who are your friends or who you usually talk to? You get sort of this weird list. And if you give them very specific cues, then they're better at sort of remembering, you know, oh, yeah, these are the people I go to temple with or this is the person that I, you know, get kerosene from. And so you give them a whole series of things that they tend to do during a day. And that gives them a more precise, when you aggregate that, then, so the aggregate network does better than any of the individual networks. The best subset of networks turns out to be putting together the kerosene, rice, advice networks, borrowing networks, and the medical help network. So if you sort of work with those subsets of networks, those are the best predictors out of. And actually adding things like temple adds noise. So it decreases the, it just, yeah. So some networks are not very good. Yeah, who do you? Right, right. So the question was, who do you go to, temple is a bad word, but it was, I don't know the Karnataka translation. So the way you check, double check the translations as you do reverse translation. But we asked, who do you regularly go to religious services with? There was a question that was designed to pick up both the, there's Muslim, a small amount of Christian and predominantly Hindu. Yeah, so I think, right, so I think, you know, once what's true is probably, you know, there's some limit in how far out I can go. So the people that I tend to hear about that are indirect friends via Twitter and other kinds of things are going to be limited to sort of distance two or three. So, you know, you can get some limited information about who are people that are prominent in my two or three neighborhood, one, two or three neighborhood, and finding the most prominent person in a large network of, you know, 700 million people on Facebook or something, that might be trickier. But just asking individuals, yeah, presumably the same technique should work in terms of, you know, who are the people, who are the people, who are your friends that you hear the most about? And there's going to be a bias in terms of not and so forth, but you still have some, I think it would work while it's possibly there too. Yeah. Yes, very good. So actually the paper, the paper that we have in Science, the 2013 paper is on that subject. So what we tried to do is disentangle to what extent it was positioned in a network and just information flow and to what extent it mattered, how many people I heard from and what their leadership status was and so forth. And it turns out in this particular application that it's just the news that seems to matter. And I think that that's easy to explain because if you're living in India, it's pretty hard to be ignorant of microfinance these days. In a remote village, it's been such the big news over the past decade that it's really hard to ignore. And so most people know that they would be interested in microfinance or not if they hear about it. And so the main information blockage was just getting the information out. But in other kinds of things where it might be, say, some insurance product that might be good for some people and not for others or something which, I don't know, a movie that some people might want to go to and other people might hate, then the particular identities of the individuals and of status and how similar they are to you could make a big difference. And I think you want to incorporate that into the modeling. Thank you. Thank you.