 for this meetup. My name is Prashant Bhattacharya. I'm a scientist at the Institute of High Performance Computing, where I work with this social and cognitive computing group. So very broadly, my work focuses on mining insights from social data across different kinds of platforms. So thank you, PyData Singapore, for giving this opportunity to share some of my work with you. And hopefully, this is probably going to be a very insightful next hour or so. So I pretty much have three broad goals for myself for the next hour. Feel free to suggest any changes. But I want to introduce some basic concepts in social network. For those of you who are interested in getting into the field, I know some of you would have some prior experience working with social networks in your own applications, in your own areas. But I'm just going to give some very basic foundational knowledge to help those of you who come from relatively near backgrounds. I'm going to talk about some social network applications, particularly the ones that are of interest to me. So this is by no means a representative list of social network questions or problems that people are working on in the data science community. But this is just a very convenient subset of problems that I've been working on. And finally, I really hope to learn from you. I'm very interested. And we'll probably talk about this offline. But I'm really interested to know why you think social networks could be of some importance to the work that you do in your sphere of work. So a question that I often get in conferences and seminars and, frankly, whoever I talk to when I say that I work on social networks, a lot of people feel that, well, yes, you have Facebook, Twitter, Instagram. But why do I need to care about social networks? And generally, I have this one slide that I show everyone. And I've been showing this more and more of late. And I think if I have to just end this presentation after one slide, it would be this next slide. And so this is, again, I mean, it is very, very exciting for Game of Thrones fans. If you watch the show, you'd probably find this of some interest. And this is actually a network of who betrays whom over the past six seasons of Game of Thrones. And if you've been following the series, you know the betrayals are an integral part of this particular TV show. And you have these different houses. And people betray each other to different degrees. So the thicker the line, the stronger the effect of the betrayal. And if you see any dashed lines, it probably means that it's implied but not proven. There's probably betrayal but not sure. And if you can read the whole blog, you can just Google Game of Thrones betrayals and get a nice social network. It's a GIF file that has nice animation. But I was thinking when I was looking at this, I was thinking that you could even go a step further and do network analysis on this graph to come up with models that can actually predict where the next betrayal is going to come from. Who's going to betray? Who's going to kill whom in the next episode of Game of Thrones? People have actually tried to do that unsuccessfully, if I might add. But this is, again, a great example of what you can do with social networks just in terms of being goofy. But again, social networks is not new. I mean, this is not a George Martin creation. People have been doing social networks for a fairly long time. I would say that it really started back in the day in the 18th century with Leonhard Euler and the famous Konigsberg Seven Bridges problem. I don't know how many of you are familiar with this problem. This is considered to be the first theorem in graph theory. And the idea was that Euler used to live in this place called Konigsberg. And he had always had this idea in his head that can I start from a point and traverse the whole mainland and the islands through these seven bridges and come back to my original point without having traversed any one bridge twice. So that's the problem. And he thought about this and he came up with a theorem basically saying that no, it's not possible. And it's only possible to do that under a certain set of conditions which he laid down. And today, I mean, when we study graph theory, we talk about Eulerian cycles and planar graphs. And we know that under certain conditions, you could probably satisfy these conditions. So this is probably an early example of when people started thinking about graph theory and social networks. But when it comes to actual data, because this is a data science community, you're interested about data, we actually had to wait a century to the early 19th century. And the data didn't come from humans, interestingly. It came from bees. And there was another Swiss naturalist called Pierre Huber in 1802 who actually studied bees for a living. And he wrote a big dissertation on different kinds of behavior among bumblebees, one of which was this idea of dominance patterns. Because bees have very complex societies. You have very layered structures. You know, certain kinds of bees do work. Certain kinds of bees are involved in reproduction. Then you have a queen bee. And Huber was trying to study dominant structures among these different classes of bees. And that arguably is probably the first time someone did a systematic study of social networks using actual observational data. So this is not new. So I'm not the first person to talk about this. Now networks exist everywhere. You don't have to go back to the 18th century to talk about networks. If you have taken the MRT to come to this session, you've already traversed the network. And I keep making this point whenever I see people browsing Facebook on the MRT, I think to myself that here's a person who is browsing an online social network while traversing an offline social network. And a lot of people don't realize this. So MRTs are a great example of social networks. And even if you don't realize this, you're actually experiencing network effects every single day as you take these MRTs. So here's a very nice infographic from Channel News Asia that I picked up. And it talks about MRT delays. MRT delays are becoming a reality of life in Singapore. And this graph shows you which lines are more prone to MRT delays. So what this actually tells you is how many kilometers does the train go before it shows a fault again. And you see the green line is particularly bad. So it sort of breaks down pretty often. And the blue line, since it's probably because it's a neural line, it is much better. But the problem starts when delays in one line start spilling over into the other. Because the MRT is a networked system, if you have delays or breakdowns in one line, it pushes people onto another line. And then it causes situations like this. And so even though you might not be interested in studying networks, you're sort of the unknowing victims of network effects in your daily life. And it's worth remembering it. What's also interesting, and something that I find particularly intriguing in social networks, is that social networks are universal. I would argue that there are very few areas of science that are as cross-cutting as social networks. So you could be doing very different things, but encountering very similar network patterns in your data. And let me give you an example of just how extreme it could be. So let me give you a bit of context behind what I'm going to show you next. So there was a team of researchers at Sapporo University in Japan working with a team of researchers at Oxford in the UK. And they were trying to study how slime molds, these are unicellular organisms. They have no brain, single cell. How slime molds form food networks. So slime molds are well known for forming really good and efficient networks for supplying food particles within their cells. And the interesting thing about slime molds is that even though they're unicellular, you can actually see a slime mold with naked eye if it grows big enough. So what they did was the researchers sprinkled, I think it was oat flakes. Apparently, oat flakes are a slime mold favorite. So they sprinkled oat flakes on a Petri dish. And they tried to culture a slime mold. So they let a slime mold grow. And what you see on the screen, column-wise, is the network of the slime mold as it spreads and tries to connect food centers wherever it finds food, forming these food tunnels to transport food particles within itself. And the photos you see are from 0 hours, 5 hours, 8 hours, 11 hours, 16 hours, and 26 hours from the starting of the experiment. So at the end of 26 hours, you see this very intricate but very clear network of food supplies that the slime mold forms. So this is within a day. Now what I did not tell you while explaining this experiment is the positions where the oat flakes were placed by the researchers. Now these were researchers in Japan who were intelligent enough to coincide the position of the oat flakes with cities around Tokyo. So the big spot in the middle was the Tokyo city, and all the other dots were cities around Tokyo. And what they found after 26 hours was that this mold network, the network found by the slime mold, resembled the actual Tokyo train network which engineers in Japan had taken decades to build to a surprising degree. And this was astonishing. This was published in science, I think, back in 2010. How a single celled organism using very simple heuristics can mimic what took engineers in arguably one of the most developed countries in the world decades to do is still mystery. But this really shows the beauty of network evolution and how networks could show striking similarity across very different contexts. Here's another of my favorites. So there's this website called Movie Galaxies. Check it out. And what this website does is it tries to show networks of movie characters for different kinds of movies. So you can enter your own movie and essentially see how the characters interact with each other. So if there's a tie between any two characters, it means that the characters have interacted at least once. And the thicker the line, the stronger the communication. And why is this important? Because if you just visualize, no prizes for guessing, but this is from a lot of the rings, any idea of which one? Any guesses? Yeah, the first one. Yeah, all right. And so you could probably see that this is a multi- character movie. There's no one important protagonist. There are many important dots in this one. A very similar structure for this other one. Any guesses what movie this one is? The Godfather. And so you see there is clearly Michael, who's the protagonist. But there are also these other characters who are also equally important. But there could be egocentric movies. So again, even if you have no idea about the movie cast and characters on the plot, you could look at this network and you could probably see that this is an egocentric movie. It revolves around one character. And this is, of course, Schindler's List. This is for Forrest Gump. Again, a very egocentric movie. And even if you have not seen the movie, which you must, you should see the movie. But this, again, is a very egocentric network. And just by looking at the network, you could probably have some inference about what's going on in the movie. And again, this is somewhere in the middle. So you could have movies with two strong characters. But this is Born Identity. So you have Jason Born on one side and Alexander Conklin on the other side. This was, again, the protagonist and the antagonist in the same movie. And this is from Titanic, which, of course, had two characters, both of whom died. But then again, it's a duocentric movie with two very important characters. Just some fun examples of what you can do with networks, even if you have no idea about the context. All right. So now that I hope I've sufficiently motivated you in why it's worth your time, let me give you some really boring details about how do you work with network data. So this is network 101. So any complex system would probably have several components. So you can think of each dot on the screen as a node. A node could be a person, a node could be a product, an object, anything. And a society is comprised of multiple individuals. So if you think of society as the whole graph, one node would be an individual. And again, if you're on the bottom left of the screen, I've actually added some code snippet. This is from the network X library in Python, not my favorite library, but very easy to use. And this product is probably easy enough if you're working with small graphs. But for larger graphs, it's probably better to use a library like iGraph, for example, or graph tools. But network S is pretty handy. And you can see it's very easy to construct graphs, just a matter of few lines. So once you have the nodes, you could also have edges, which are really relationships. And the relationships could be friendship, kinship, co-authors, romantic relationships, et cetera. So it could be any kind of relationship between the nodes. So the one benefit of really forming this relationship is you could then ask questions about prediction. So in this case, you could probably ask, who will Alice befriend next? So given that this is the current structure, who's Alice going to befriend next? In terms of Game of Thrones, you could ask, who's sirs are you going to kill next? So you have similar questions about prediction. You could also have questions about explanation. So why did Alice befriend Bob? This is more retrospective. If I give you the data about the network and the behavior, can you uncover reasonable explanations for why something happened? So this is the explanatory part. So again, a quick mental floss for all of you. So if Alice starts to smoke, again, an observable behavior, what is the likelihood that Bob will start smoking too? So this is an open question. You have to think about it. So you could also think about, why should there be a correlation to begin with? And if there is a significant correlation, how can we be sure that it's because of the network type? It could be about for various other reasons. And finally, a more practical implication would be that if we do find evidence of some kind of network effect, can we then exploit or leverage this knowledge to design some kind of intervention to cure someone of a particular disorder or a particular addiction? So these are sample questions that you would probably think about when thinking about networks. Back to network properties. So as I said, you could have actors. And then actors have relations. Now these actors and these relations have different types. So a very simple network could be an undirected and binary network. So undirected because the edges don't have arrows. So these relationships have no direction. So you can think about friendship in the real world. Probably doesn't have any direction. But on Facebook, it does have direction. So you know who sends a friend request. So you can have a directed arrow from one person to the other. Now it could be a directed and binary. So again, Facebook friend networks would probably fit into this one. Now the edges themselves could also be weighted and not binary. So instead of having zero or one, it could also have some kind of weight. Think about your communication network on your phone. So if you call someone once a day versus five times a day, this can be the weight on the communication networks. And again, weights tend to be very, very important in a lot of predictive models. And finally, it could be directed value. So it's a 2 by 2 kind of an option. So you can be binary or weighted, directed or undirected. Now how do you store network data in terms of data structures? The simplest way of storing network data is through what's called an adjacency matrix. So where you have nodes on two axes. And if there is an edge, you place one. If there is no edge, you enter zero or leave it blank. This is for, again, a binary adjacency matrix. So for directed, as you can see, it's not symmetric. So A to B is not the same as B to A. So this is probably the most common way of storing adjacency matrices. But there is a big problem in this form of data structure. Any guesses? What's that? What's N squared? Yes, so it doesn't scale. So when you have a really large network, this creates a big storage problem because you're essentially using a lot of memory for storing very little information about the network. Which is why you could probably store it as sparse matrices, et cetera, in Python or R. This is possibly a slight improvement over adjacency matrices. So you can convert any matrix to a list. This is a much more optimized way of storing network data because you're only storing the edges. And finally, this is possibly the most common form of network data, our edge list, where one row essentially has one pair of nodes. And you could have optional columns for the weights. So each row could have multiple weights, actually, depending on how many kinds of relationships you're trying to model. So there are also node level features. So again, I use the term attributes synonymously with feature in terms of machine learning context. So you could think of centrality. So it's very simply put, centrality talks about importance of a node in a network. So degree centrality essentially is the number of connections that connects a particular node. Then you could have an out-degree and you could have an in-degree. So an out-degree are the number of nodes that flow outwards from a particular number of edges that flow outwards from a particular node. In-degree would be the number of edges that are incident on a particular node. So in terms of adjacency matrix, so if you have to consider the out-degree of C, it would be a row sum. An in-degree of C would be a call sum, a sum of columns. You could have slightly more sophisticated centrality measures. So closeness is a very interesting measure that essentially gives you the distance of a node from every other node in the network. So you can think of starting from one MRT in your Singapore MRT network and trying to find out what's the shortest distance by which I can reach some other MRT station in a different line. And all of you navigate this question every day. So what you're doing in your head is essentially computing the centrality, the closeness centrality measure. So in terms of mathematical formulation, it's essentially the inverse of the farness. So d of i and j is the shortest path between i and j. And you take the inverse of that, you get a closeness. Betweenness is probably the most computationally intensive metric to compute. So betweenness gives you a measure of how many shortest paths of the graph pass through the focal node. So if you want to find the betweenness of a particular node i, it means how many shortest paths in your network pass through i. So if you think about how you would compute this, you would very soon realize that this is a computational nightmare, especially for really large networks. And so there are approximations to make this faster. And it also helps if you have a parallelized version of whatever functions you're using to compute this. Finally, and this is probably my least favorite centrality measure, called eigenvector centrality. And the reason I don't particularly like it is because it's reflective in nature. So what eigenvector centrality means is that the centrality of a particular node is a function of the centrality of all its surrounding nodes. So if you have really important friends, eigenvector centrality treats your importance as very high. Now this has an inherent problem. And can someone guess what this problem might be? Now think of it in your own social life. If your importance is measured by how many important people you know, why might this not be a very clever way of computing importance? All right, so I'll give you the answer. It's because going forward, this is a reflective process. So if in first month, your importance is a function of all your friends' importance, and you have really important friends, then even if you're not important, you get a very high score. Then in the second time period, since you're very important, it feeds back to all your friends. All your friends' importance increases as a process of reflection. And this goes back and forth. And over time, you'd get a network where few nodes have very high importance, and few nodes and some other nodes have very little. So it sort of clusters the importance in very small portions of the network. Now what this network shows you is that you could have a network with different nodes having different levels of centrality. So in this network, you have different nodes which have a higher degree or higher betweenness or higher closeness. So there's no one size fits all explanation for centrality. Depending on what you're looking for, different nodes in a network could be considered to be central or not central. But this is very important to keep in mind when you're working with real applications. So before I get into some open applications that I'm currently working on, is there a question that you're probably thinking about or something that you'd probably want answered at this point? Yeah, go ahead. You mentioned about a high-end centrality, but isn't that close to what Google uses for page rank algorithm? Yeah. So are you saying that because of the Google's page rank, this problem is already happening in the importance of websites and what importance of websites? Well, so page rank and other kinds of centrality like card centrality, there are improvements of high-end vector centrality that they try to rectify for the reflection problem. But it uses a similar principle. But since they try to rectify for the reflection problem through approximations, it's less acute as with eigenvector centrality. I don't think there are any real-world applications today that actively use eigenvector centrality. Page rank is much safer. Yeah. Right? Yeah? I'm sorry, could you repeat the last one? The repulsion strength? Yeah. Sure. That's a good question. So if I have to summarize your question, you want to know if in current contemporary social network research, do they take into account different types of centrality measures in conjunction to come up with a more accurate measure of centrality? Yeah. I would agree. I mean, no one knows for sure what algorithm LinkedIn uses for computing its influencers. But again, as I mentioned, different members in the network could be central for very different applications. So having a one-size-fits-all approach to identifying central members in a network can be fatal. So if you're thinking about influencers, you have to ask yourself the question, am I looking at influencers in terms of job seeking? Am I looking at influencers in terms of recommending new friends? In these two applications, you could have very different applications of closeness centrality. So I would say, for example, closeness centrality, the way I just explained, would be particularly useful in terms of recommending new friends. But if I'm looking at influencers in terms of someone who can provide me with high-quality content on a particular topic, I would probably go for someone with high-betweenness, because this is a person who sort of joins multiple clusters of information. So he or she is likely to have access to richer information than someone with high closeness. So again, yeah, I totally agree that you would have to take a factor in these different kinds of closeness centrality measures in coming up with the conclusion. All right. Any other questions? Go ahead. Well, one way to overcome this problem is, of course, to discount. So you have to break the reflection process. So if you influence your friends and your friends influence you, and this goes back and forth, you have to have some parameter or some way of kind of probabilistically breaking the cycle, such that you influence your friends, but then only a subset of them can influence you with a certain probability. This is a very naive way of doing it. But if you do that, then it sort of breaks the cycle of reflection, and it gives you a much better approximation of the centrality measure. Yeah, you could also normalize by the number of friends that you have. So this problem would be particularly accurate if you have a lot of friends. So if you normalize by the number of friends you have, you sort of reduce the intensity of the problem. Yeah. All right. Yeah, go on. Yeah? The person would be a terrorist. I'm sorry. Could you rephrase the question? How can you have a terrorist in a certain group based on the numbers that they have? Who's the most probable one? Terrorist. Terrorist. No. But that, I mean, other people have done that to actually, with observational data, the problem with terrorist networks is you don't have data like ground truth data. Because if you had ground truth, you would probably be able to stop the attack in the first place. But there are research groups that work on constructed terrorist network data from newspaper reports and articles. And they have been able to retrospectively predict certain kinds of attacks in certain places. So yeah, there is definitely a use case for using network. It's just that the network data for terrorists is so hard to come by. Yeah. Yeah. Unlike Game of Thrones. Yeah. All right. So let me talk very briefly about some open problems and open research areas that I'm working on. And it's going to be very hard to follow the terrorist network question. So nothing as interesting as plotting terrorist networks. But before we get into that, I want to talk about there are some unique research problems that you would face if you're trying to study social networks or in a research setting or in industry and applications. So there are basically two broad areas of work with social networks. First is about structure and evolution. So understanding how certain kinds of networks evolve over time. The other is about processes. This is about mechanisms. So you can think about smoking as a behavior and ask the question, how does smoking as a behavior spread in a network? This is about processes. But the big problem is that structure and processes are both what's called confounded. And by this, what I mean is that you cannot study one without controlling for the other. So think about it. If your friends influence your behavior, you have to ask yourself the question, why am I friends with this person to begin with? So there have been studies in the past where they showed that smoking can be contagious. So if you have a friend who's a chain smoker, it's a very high chance that you'd start smoking too. But then the critics of this approach would tell you that maybe you met the person at the smoking pit. And that's how you guys became friends. And so if you ignore the network formation process, you might end up with very different conclusions about what's happening. The big problem in this area is that it's very hard to do experiments. So as you would know, an industry and research gold standard in understanding these things is doing experiments, doing A-B tests. But the big problem with networks is doing randomized controlled trials or what's popularly known as A-B tests is very, very tricky. And I want to talk about it in the next few slides. Which brings me to my favorite research area, which is peer influence. So I like to study influence processes in different areas. So how do people influence each other for a variety of behaviors? So again, a little bit of math just to give you the impression that it's something important. For a person I, if you intervene on some behavior, you probably observe a change in outcome. This is a very simple explanation of this one line. What I'm interested in is what happens if I intervene on the person's friends. How does that change the focal person's behavior? So if I have to say, let's think about your best friend and I intervene on some behavior of your best friend. So maybe I try to imprison him and I try to see the change in your behavior. That's peer influence. Peer influence processes are everywhere. This is not something new. Peer influence has been studied in adoption of brands or products. So if you buy a new iPhone, how soon does your best friend buy an iPhone? That's peer influence. You think about diffusion of innovations. So there are villages in Southeast Asia where governments are trying to introduce microfinance programs and they essentially rely on peer influence processes to diffuse these innovations in the communities. Spread of disorders. So there are some studies that actually talk about how certain kinds of diseases or for example, or disorders like eating disorders can be contagious. And there have been criticisms of those studies as well. And finally, voting behavior. And there are a lot of studies for obvious reasons that look at how the pattern in which your friends vote leaders can influence your own voting patterns. And there's a very famous study for Facebook which showed that on the election day by showing notifications of how many of your friends are going to vote can significantly change the chances of you going and casting your own vote. So by using these social cues they can actually increase voting turnouts. And finally content consumption. So this is very common. So whatever you post or consume on social media you can do it by your friends. All right, but the big problem in studying peer influence is that it's always mixed up with what's called homophily. Homophily again, very simply put means birds of the same feather flock together. So if you're similar in certain respects you're probably hanged together. And this is a problem when trying to study influence. So over the years people have observed that when you are in a particular pair so if you are in a relationship with someone be it any kind of relationship there's a high chance that the two of you would show very similar behaviors. But the question is is it only because of influence? All right. So let me tell you a hypothetical story to argue against this. And this is from one of my least favorite research papers by Chalisean Thomas. Not my least favorite research paper because it's a bad paper. It's actually a fantastic paper. It's my least favorite research paper because this probably delayed my graduation by a year. So suppose you have a situation where there are two friends named Ian and Joey and Ian's parents ask him hey if your friend Joey jumped off a bridge would you jump too? And the question the authors ask is why might Ian answer yes? So a simple answer would be Joey's example inspired Ian and this is influence, right? This is what all of you would normally think that my best friend influenced me to jump off the bridge. But there could be at least five other explanations for what just happened. So this is another case. Maybe Joey infected Ian with a parasite which suppresses fear of falling. So this is what's called biological contagion. It's like he injected Ian with this particular virus that somehow messed with his mind. The other option could be that Joey and Ian are friends on account of their shared fondness for jumping off bridges. And this is really homophily on the focal behavior. So both of them for some weird reason are like jumping off bridges. And it's just so happened that they jumped off the bridge at the same time. Here's another one. Joey and Ian became friends through a thrill seeking club whose membership you can freely see. So this is also observed homophily but on a different kind of behavior. So they join this club and this club have weekly meetups only that the meetup like topics are very weird. So on that particular week the agenda was to jump off bridges. And so this is homophily but based on a different behavior. It could also be a case of unobserved homophily which we don't observe. A good example of this is maybe both of them have an inner desire for thrill seeking which none of us know about and their parents didn't know about. And this inner desire drove them to jump off the bridge. And the last one maybe both of them realize that the bridge is about to collapse and jumping off the bridge might be a safer option. So this has nothing to do with influence or homophily. It's just a situational factor that drove both of them to a similar activity. So there are at least 6 reasons why certain behaviors could be correlated. So how can you disentangle influence from these other 5 factors? This was sort of something that I tried to address in my PhD which is probably why it took a lot of time. So proving existence of homophilic influence is not easy. Quantifying homophilic influence is probably possible from data in certain specific contexts. Again not true in the general case. But isolating the effect of one from the other is extremely hard. And this is probably the central problem in social networks research. So if you're interested in reading the paper you can probably take a look. But this is not just a research problem for those of you who might be wondering if it's just an academic problem. Not really because uncovering influence is actually key to making the right managerial decisions. So for example if you are Apple and you're launching a new product a new iPhone and you're interested to know how this is going to diffuse through the network you should have a good idea of how diffusion works. And diffusion is basically influence. Now quantifying homophily is also equally important. If you're Facebook building a new friendship recommendation engine it's in your best interest to understand how people form friends on Facebook to begin with. Are they forming friends based on similar age or similar gender, similarity in TV or movie viewing experiences etc. This is important. And there have been serious public policy debates all over the world based on this fundamental conflict. So it's smoking contagious. How many of you here think that smoking as a behavior is actually contagious? Sure, fine. And how many of you think that it's not contagious? Individual choice. So it's evenly split. So you see why this is a serious public policy debate. How about adolescence? Maybe this is contagious for a certain age group but not contagious after a particular age. In a different setting think about e-learning or education. Does encouraging group based learning improve learning outcomes? When you're learning with your friend do you learn better? How must the groups be formed? There's actually a very interesting paper where they did an experiment on group formation where they tried to form groups randomly versus pairing the smartest kid with the not so smart kids. So in every group you had a really smart and a not so smart kid and they tried to see if that was a better configuration of group formation. I won't tell you what the result was you can go and check it out. But any guesses? Which group do you think performed better? The first one? Second one? How many of you say the first one? The random groups learn better? And how many of you think the second one performs better? You can go and check out the study you'd be in for a surprise. Okay, so let me give you a real world example from a study that actually did solve this problem to a large degree fairly well. And this is from Facebook. These are researchers at Facebook and they were testing their social advertising feature. This was back in the day when social advertising was a new thing. And the idea of social advertising for those of you who are not familiar has shown ads on Facebook which have social cues. So they tell you that so and so has also liked this particular page. So this is a social cue and they try to see if showing you a social cue increases your click through rate for a particular ad. So these were the two ads that were shown to different groups of people. For the ad on the left there are like 350,000 odd people like this but it's a generic ad. They don't tell you if there is a friend who's like that page or not. For the ad on the right they tell you the name of a specific friend from your friend network. This is a social cue and so the number the variable D essentially tells you the number of social cues that are shown to you and they wanted to see in which category the number of clicks or the number of likes for the page is higher. So they also wanted to see if there is a marginal peer effect so from not just 0 and 1 but does it also increase from 1 to 2 to 3. So you see for the first one they only show you one friend in the second one they show you two friends third one they show you three of your friends alright so here are some results so if you see the variable Z on the top that tells you the number of friends in your friend network who have liked that page so if Z is 1 it means just one of your friends have liked that page for 2 and 3 it means 2 and 3 of your friends have liked that particular page now you see in these three conditions D is 0 which means no social cue was shown but even then you see an increase in like rate for the 3 for Z equal to 2 and Z equal to 3 which means that even if no social cue is shown if there are more people in your network who have liked that page there is a higher probability that you will like the page as well and this forms the basis of targeting so this is how you target ads and this sort of shows that there is homophily so if you have more friends in your network who like a certain thing there is a higher chance since you are friends with those people there is a higher chance that you like those things as well but then they also show that there is influence so for each of these groups when you actually showed one social cue you got a further lift in the page like rate which means that after controlling for the number of people in your friend list who have liked that page showing a social cue always helps for each group this was fantastic because in one study they showed there is homophily and also influence and they could quantify then they took it further by increasing D from 0 to 1 and then to 2 and 3 and showed that it further increased so now this is conclusive evidence that peer influence works in social advertising so this was from facebook but there are some problems in doing these kind of studies the first problem that comes up is in sampling individuals during randomization now the first the example that I just showed you was easier because you are just showing a social cue but you could think about other kinds of A-B testing applications in your areas where if you have a network of people how do you choose which nodes to treat which nodes to put in the treatment group and which nodes to put in the control group and why might this be a problem so think about Skype wanting to roll out a new feature or actually in this case think about Facebook launching a video chat feature now it randomly assigns the feature to certain set of nodes now what happens if none of the friends of the node that was treated are also in the treatment group so if I don't have any friends who have also been treated with the feature who do I chat with so this is a problem so you can't just blindly randomize nodes in a network for an A-B test the other problem is about network interference and this actually happened at my workplace a couple of months back so we have a nice sprawling space where we have our individual cubicles as to most of you I presume and we are working and interestingly we get our internet from three different access points and we generally connect to one of the other randomly depending on our mood on that day there's no particular reason why anyone connects to a particular access point and on that particular day what happened was one of the access points went down so it randomly cut off internet access for a random sample of people so you would assume that the people whose internet got snapped they would have nothing to do so they would just stand up and roam around because nowadays what can you do without internet and for the rest of us who had internet would be focusing on our work as usual right but after a while when I stood up and looked around I saw everyone was roaming around so what happened well the people who lost their internet access connections actually stood up and went to chat with their peers who still had internet and in a matter of time everyone behaved like they had no internet connection so this is a good example of what's called network interference where even though you might not be treated with a particular treatment just because your friends were treated it spills over to you and this is a problem in networks and finally the other big problem is how do you preserve network structure so if you have a treatment group and a control group in a social network how do you make sure that the structures the underlying structures are similar and this becomes a problem especially when you have structural parameters in your model right again back to facebook so if facebook had to introduce a video chat feature and just to assume this is the facebook network the actual network is slightly bigger than this slightly bigger yeah and so what the way you do this experiment is you would place a coin on every individual on facebook you toss it and depending on whether it's head or tails you would assign the user to treatment or control group that's how you randomize right but let's say this is how you do it so on the users where you see this nice smiley face are the ones who have the video chat application and the other users are the ones who don't have the application now this creates a problem because there might be some users especially towards the leaf nodes who might be surrounded by users who don't have the application so whom do they talk to so what's the perfect way of doing this experiment the perfect way of doing this experiment is if you had a parallel universe in which you had the same network at the same time so in this universe you treat everyone in the network and in the parallel universe you leave everyone as is and then over time you see the change in outcomes for the treated group versus the control group but till the time parallel universe has become a reality you'd have to make do with something more possible so what's this possible solution to this problem any guesses you can be as weird in your answer as you want because this is an open problem was that do it at two different times do it at two different times alright so that's one way of doing it but you don't control for any factors that might be time varying if there's some factor that changes over time but that's a good option so what he said is I could do the treatment and control for the same users but over different periods in time alright not the best way but this is definitely one possible way any other crazy ideas so you could also do some kind of edge randomization to solve this problem think about it this is an actual feature right so even if you randomize edges how would you there might be a chance that you might create edges between people who don't really have an edge in real world so that's a possibility here but in theory yes that's one way of doing it do it in small groups and clusters find similar clusters and try to randomize between the clusters yeah so exactly so that's how it's done so I'll talk about graph randomization but before I get into that there's a much there's a really simple way of doing it that most tech companies generally follow that is the right way of doing it what is that sorry someone said geolocation release by location yeah pretty much so this is the most preferred strategy in industries called the New Zealand option so if you've noticed Facebook is particularly notorious for releasing all its applications in New Zealand first and the rest of the world after that and the reason it does that was actually the director of engineering said it himself that the users don't have many international friends so it works well as a very nice and isolated treatment group and studies after studies have shown that the QEs don't really talk much to their international neighbors maybe to Australians but not beyond that so they serve as a natural treatment group for the rest of the world so this is definitely one way but this is nice but then they have the other limitation that it's still two groups it's New Zealand versus rest of the world and again this is particularly convenient because it's a good enough sample to have statistical power so it's like 4.5 million users English speaking so it fits in perfectly with being a guinea pig for doing experiments but it's still two groups so the other possible option that Facebook at least does is I think what you mentioned you do graph randomization you graph cluster randomization so what you do in graph cluster randomization is you try to partition your network through some way of partitioning into different clusters and then assign each cluster to a treatment or control group so one way of doing partitioning is natural partitions so New Zealand but then you could also think about algorithms so there are graph cutting algorithms and combining algorithms that you can use like community detection or label propagation and using these algorithms you can form clusters and then you can randomly assign clusters to treatment or control and then based on that you can so this again has its own share of problems so for example there might be some clusters towards the boundary and some clusters more interior which might have differences but this seems to be the state of art in doing A-B testing on social networks doing graph cluster randomization so for the study that I just mentioned about chat applications Facebook actually I think they did 6400 clusters for just their English speaking American sample so using the algorithm they were able to generate these clusters and do the randomization so for more details you can actually look the paper listed here is this guy called Johan Uganda so he's at I think Stanford and so he does a lot of very interesting work on graph cluster randomization so you should look him up so in case you cannot do experiments and you're only left with a huge data set respectively what can you do with it can you model the counterfactual using structural approaches and this is sort of slightly similar to the edge randomization idea that you're talking about if I have data can I simulate the treatment and control group through some probabilistic models again not ideal but this definitely is possible so you could think about propensity scores some of you here are from biostatistics or have some stats background you would know about this method called propensity score modeling where essentially you model each individual's probability of being in a treatment versus control group this is particularly useful if there is some reason why people might pre-select themselves to certain groups you could use Bayesian or graphical approaches this is particularly common among computer science researchers where you could think of graphical models to jointly estimate different kinds of intervention and the effects you could also use variants of random graph models and this is probably closer to my area of work where I try to construct random graphs and try to see how they evolve over time following certain kinds of rule based evolution but all these methods are highly assumption intensive they need to follow a lot of steps for these algorithms to actually be valid in the real world a nice trade-off between actual experiments and observational data or what's called natural experiments which as the name suggests is a way in which you let nature or God sort of partition the treatment and control groups and then you opportunistically use the setting to test your theories so a good example of natural experiments are natural disasters so my advisor has a paper looking at the effect of hurricanes on the change in social networks in the US so as you know hurricanes strike a particular area a particular geographical area so you could treat people in that area to be treatment groups and you can think of similar people in other areas neighboring states and neighboring cities to be your control group so it has a natural way of splitting your users into treatment and control and then you can observe different changes and behavior between these groups so other ways of studying natural experiments could be migration so the study here by Munshi is about Mexicans migrating to the US for work and how different over time Mexicans in the US versus Mexicans in Mexico form two different groups based on which you can do different kinds of studies and group assignments so if you are studying a particular module and you have groups formed based on different criteria you have this natural way of studying different group outcomes based on different treatments alright so let me give you a quick teaser of some of my ongoing projects and if you are interested in any of these you can talk more offline this is a study from my thesis this is under revision actually in a major journal and again the problem I am trying to solve here is very simple so here is me on Facebook I am trying to post a lot of content and let's take for the sake of simplicity that I have three other friends on Facebook alright yeah they look like those those people not exactly them and they are all producing a lot of content every day they are always in the news so they are posting a lot of stuff on Facebook and Twitter every single day so the fundamental question that I am trying to answer here is when these individuals increase their rate of posting what happens to my rate do I increase my rate of posting because maybe I am encouraged or motivated by the supreme leader and I want to be like him so I increase my rate of tweets and Facebook posts as well or maybe I decrease it because I am really sick and tired of reading all these Facebook and social media posts and I get so exhausted that I say no to hell with it I am not going to post anymore so these are competing theories and so we partner with a major social network site based in the US so we have a big data set over four years in the traditional experimental methods we are trying to tease out which theory plays out in the real world but there are three problems social media postings unlike smoking is a fast changing behavior you could be posting a lot on weekends but in weekdays you might be just posting once in five days so it's a very inconsistent behavior unlike smoking which is consistent link formation is endogenous so there might be preferential attachment to ask myself the question why am I friends with Donald Trump does it signal some kind of underlying similarities that might be at play and experimentally manipulating peer behavior is very hard and I spoke about this in a few slides back so I can't force Donald Trump to increase or decrease his tweets so that I can study the effect on others so I have to rely on retrospective data to find out problems this is just one example of a project that I am working on something more business friendly so we are looking at trying to predict credit worthiness in emerging communities particularly with micro finance initiatives so as you would know financial exclusion is a big problem in developing countries there are millions of people with no access to bank accounts and because of that they have no they cannot get loans they cannot get credit from these bank institutions however these countries have very high smartphone penetration some of these countries more than 80% pretty high so this gives us a very interesting opportunity how can we leverage mobility and network data to empower these institutions that's sort of the driving motivation for some of my recent work so we collaborate with a south station based mobile micro finance company who have been kind enough to share their data with us and so what we are trying to do this is our insight from the study is that we are trying to hypothesize that maybe people who default on their loans might have a different mobility pattern before and after they draw the loans as compared to people who pay their loans on time and we actually do find that we find that people who default on their loans this is just an illustration this is not the actual results but it is similar to what we find we find that once we factor in the locations that these people visit before and after they draw the loans into our predictive model we get a huge lift in accuracy in predicting their default behavior which is very interesting and intuitive at the same time because you would assume that people who draw really big loans might end up spending it in vices they might go gambling they might go to fancy restaurants and this gets captured in the mobility behavior so this is one interesting use case of networks data that we are looking at so the network angle in this project is that everything in different locations might also be attributed to network so you might be going to restaurant not because you like that restaurant but because your friend dragged you there so we somehow factor in the network among the loan applicants to sort of see how mobility and networks play together to help us better predict outcomes this is the more recent work that I am working on and in trying to understand how this behavior is contagious like a disorder so this is the question that we are asking is there a network effect in loan default behavior simply put it means is my loan payment behavior influenced by the payment behavior of my immediate peers so if all my best friends are paying their loans on time does that somehow influence me to pay my loans on time or default if yes then does that also depend on my credit risk so what this means is our individuals with higher ability to pay these are credit risks so when the bank thinks that you are a high risk person versus you are a low risk person are high risk persons more susceptible to peer influence than low risk individuals so this is from the actual paper this is ongoing work so this is just a visualization of a low credit risk sample where the red circles are people who paid their loans on time the blue ones are people who defaulted so you see this probably makes sense so if you are a low risk individual there is a high chance that you pay back your loans on time and this is for a high credit risk sample so you see a lot more blue dots so there are a lot of people in this sample actually defaulted on their loans now for both of these sub-samples we are interested to see if there is evidence of contagion in the payment behavior and this is what we found so for the first sample for the low credit risk sample we find that most of the users are clustered near zero so there is actually a very limited evidence of contagion but very interestingly for the high credit risk sample we see these two peaks on both sides so we see evidence of influence but for both positive influence as well as negative influence which means that high risk individuals are susceptible both to positive payment as well as default so what this means is that high risk individuals basically react more strongly to both good peers and bad peers which again is very interesting for microfinance institutions to learn finally, and I probably leave you with this slide one future direction for social network research is in understanding if we can go from peers to structures and topologies so do network structures matter so so far I've been able to sort of argue the point that what you do matters and what your friends do matter but what about the relationship the interrelationships among your friends is that important so you have two networks having the same set of nodes but a very different network structure so the second network has a larger number of cycles higher clustering so does a highly clustered network influence your outcome more than a lower clustered network and in which kinds of outcomes is this a case and again what about incompletely observed networks and I think if any of you work on social network data from real world platforms you're very familiar with incomplete network so you probably know details about some users but you'd be completely unaware about some other users who are connected to your focal user and how do you deal with networks that have a lot of incomplete data so this is again a very active area of research so my one line summary for the whole presentation kind of like a take away line is what can we do with incompletely observed an egocentric network so you only know network for a certain sample of users and not for everyone incomplete egocentric network how can we leverage this kind of data for predictive as well as explanatory applications alright so this is really what I had for you I'm really interested to talk more offline especially if you were working on network context in urban systems so you could be thinking about network resilience of MRT networks power grids adoption diffusion of different kinds of shared economy applications you could be working in fintech very similarly what I just explained you could be thinking about network credit scoring and network analytic applications in credit scoring about how do you profile applicants loan applicants based on their social network data you could be working on healthcare and understanding how healthy behaviors diffuse through a network so you'd see fitness applications how do fitness applications gain pure influence in networks you could also try to think about how health communities online evolve as a result of these interactions among its actors and finally you could be thinking about e-commerce businesses and trying to understand how your friends might be influencing what you buy online you'd see e-commerce sites have social cues as well they try to tell you what your friends are buying in the hope that you would make similar purchases or not so these are some applications that I could think of and if you're interested or working on any of these applications and want to know more about want to talk more about networks and what we could work together very excited to talk to you thank you very much for your audience and reach out to me if you have any clarifications, comments or just if you just want to chat thank you very much any weird out of the box questions how would you say if you had a Facebook data compared to the real life data I actually get this question a lot from my friends in sociology and they always tell me when I tell them that I study social networks they don't study real networks these are not real people real networks are people you meet offline and I always tell them that well yes and no, I mean if there is a person who spends 16 hours on Facebook every day how do you say that's not real and the person you meet offline is the real version right so there is no right or wrong answer to what you just asked the question you're trying to answer if you're trying to answer a question that demands data from offline networks and the data you have is from Facebook then you're going to end up with very wrong conclusions and vice versa if you're trying to predict behavior on Facebook and you ask people offline you're going to end up with very inaccurate results for example you could ask individuals and this was an actual study thanks for asking that question so there was again from Facebook researchers asked people how many friends do you have in your Facebook network and they came up with a heuristic some number which they thought was correct and then they asked them how many of these friends do you think read your last status update the last update that you posted how many of your friends do you think read it and then they went back and they verified their answers with actual Facebook logs server logs to try to see if the users underestimated or overestimated the number what's your guess do you think users underestimated the number of friends who read their posts or overestimated how many of you say underestimated and how many of you say overestimated right so majority of you feel that users tend to think that everyone's reading their post actually no so the study found that the individual systematically underestimated the number of people they thought read their last post because they have no idea so there's a lot of studies in social psychology that talk about this spotlight effect so people generally think that you always think that everyone's looking at you but in this particular study they showed that on Facebook a very different kind of social psychology plays so it's the opposite of spotlight effect even though you think that fewer people are looking at you than the actual case so in the real world and the virtual world even social psychology theories can be completely flipped so that's again depends on what you're really looking at yes why am I why am I not why do I not work for twitter oh yes yes that's correct yeah I haven't but there have been researchers who have studied systematic differences in facebook twitter myspace was there before facebook and there was one of the Japanese social network site what was the name yeah but there have been studies that have looked at cultural and also behavioral differences in how people behave on these platforms so if you look I mean you can go and search on the internet you'd find many such papers that look at but this is mainly done on student samples with adolescents I haven't seen a lot of studies done on mature populations yeah yes right so my preferred social network library and package would be iGraph because I generally deal with larger networks but if you're looking at smaller networks you could use network x in python you could use sna in r if you're looking at visualization you could again I mean I generally don't like the visualization tools in r or python so I sometimes use this tool open source tool called geffy so it's a tool called geffy it's an open source you can download it and you can integrate it with python r so you can just dump your network as a graph ml or as an edge list or as a sensory matrix and you can import it back into geffy but in terms of analysis for dynamic networks over time packages for dynamic networks in r that I generally use for my research plus visualization so if you're using b3js or any kind of js based visualization you could use python or r because they have libraries to integrate that with js but if you want something more sort of click and point you could use geffy that's for the visualization part that's sort of my pipeline yes sure top of line part there was one other yes just curious is there any similar characteristic problem in terms of their network characteristics yes I mean a very simple answer could be that they have high degree centrality so if you have important personalities they would have many followers but in terms of the other big problem in networks is idea of bounded rationality so you might have a sense of how many friends you have but you don't have a good sense of how they are connected right you might have a heuristic you might know that your best friends are connected to each other but beyond that you have no idea so it's hard to say I don't know if so there have been studies looking at networks of politicians and trying to see if there is any structural difference but again these networks are egocentric networks so it's incomplete data so a company like Facebook could probably do a more detailed study because they have all the network but so the one interesting difference that I think of right now is this famous small world study that Facebook does from time to time so the small world studies starts with you can get your own small world number the idea of small world is starting from you how soon in how many hops can you reach every other person on Facebook in the Facebook universe and they try to show that really important people have very small small world numbers so a person like Mark Zuckerberg has like I think at least in three at this point the original study by Stanley Milgram had actually shown five point something to be the number what's that five points and that has been Facebook has been doing the study every year and it's been going down each year and they use this as an ICPR to show that Facebook makes the world a smaller place so yeah structural difference as well it's sort of related to closeness but again this is a study that only Facebook can do yeah I want to go about getting data if one is interested in doing one of the study you mentioned that I did rating part I have a similar idea but I struggle to even figure out how I am entrepreneur or a smaller firm and go about doing some social network and others so based on my limited experience I can tell you that if you want to do a really good job with any kind of social network research problem you need a very good data set and in my limited experience I'd say that data set would probably come from a collaborator so you would have to do a project with an industry partner who has access to the data back in the day they used to do network studies with surveys to administer survey questions to individuals and ask questions like can you tell me six of your closest friends so that worked well for some time but then it's not feasible anymore so you need a collaborator who is willing to work with you on such problems alright do you have any other questions or we could talk offline I'd stick around for a while alright great thank you very much