 Please welcome Conrad Lee, who will be holding a talk on privacy invasion or innovative science around applause Okay, thanks for the introduction. Can you hear me in the back? Is it okay? Okay, thanks so I'm gonna put into question the current privacy stance that many academics have towards public Social media data sets. So I want to make two points. The first is that privacy issues are preventing a leap forward in the study of human behavior and More specifically, they're doing so by preventing the collection and public dissemination of high-quality data sets The second point I want to make is that in many situations the current Behavior of academics and sharing these data sets Is I think too sensitive to policy to privacy And this is because privacy on many of these sites is an illusion And it's not an illusion that academics need to propagate Okay, so this is a I'm not a privacy researcher. I use social media data sets and To research how things spread through social networks. So this isn't a theoretical talk. It's just I'm gonna outline some of the issues I've faced in my own research So here's an outline First I'll motivate the whole thing on why these social media data sets are interesting how they will help us make new discoveries about human behavior and Perhaps the most interesting data sets would be those that could come from Facebook. All right, so I'll go over why that's the case and There was an attempt to create a very Interesting data set from Facebook called the taste ties and time data set and I'll go over that and why that Projects sort of failed in the end and to put the whole thing into context I need to also introduce the evil twin of taste ties and time the Facebook 100 data set Okay, and then I'll put into question the current attitude towards sharing data from social media sites and I'll end just with one slide on Some more questions that are coming up by that Originate from the ability to infer user attributes based on a social network. So even if you don't Tell a social network, even if you leave many profile fields blank, these can be filled in and this raises additional privacy concerns Okay, so Sociologists at Harvard In a New York Times interview said we're on the cusp of a new way of doing social science Our predecessors could only dream of the kind of data we have now and why is that? So there is some stubborn questions that sociologists haven't been able to answer Such as do tastes or beliefs determine who you become friends with Or is it the other way around to our friends influence influence us and do they determine our taste in a way? Are there opinion leaders? Is there a small minority of people who are responsible for convincing the people around them? of what they should think and There are many other questions I have to do with contagion and diffusion In social networks such as is obesity contagious. This was a recent result That's drew a lot of media attention that if your friends become fat you're at a higher risk of becoming fat, too So these questions are increasingly not only studied by sociologists, but also physicists and computer scientists Who use models from their fields to try to answer these questions? Okay, and the reason we're on the cusp of this revolution It's not due to new methods will also do to new methods, but mainly due to the data that's available and the data on many social media websites is Especially interesting for three reasons you have a social network this interaction network on Facebook You can think of the friendships on Twitter. There's a following relationship so many of these social media websites include the ability to interact with users and that's all recorded and crucially this network data is revealed or observed rather than obtained through a survey so Sociologists have been looking at social network data for a long time now, but Traditionally the social network would be based on a survey and it turns out that if I ask most of you here Who are the 30 people you interact with most? You wouldn't provide a very accurate list it would differ from the list of someone actually followed you around and Looked at who you interacted with So if you're a cell phone service provider They will actually maybe have a better idea. They'll be able to provide a more accurate list or Facebook for example could And then of course the state is on a much larger scale than has been traditionally possible to collect and finally The data is collected over a long period of time and all these features are important to answer these stubborn questions that have been Have been posed for some time without much progress So where is the data at this point social media websites been popular for a decade or so or nearly and There still aren't Carefully curated data sets for researchers data reuse is still the exception rather than the norm Often the data is shared for a little while and then it's pulled down at the request of a site like Twitter And this leads to many problems and this is what's really holding back Progress towards answering these questions So the first most major problem is that of what replicability so I Mentioned before this recent finding that obesity is contagious across friendship ties and this finding is highly contentious And there are other economists who have shown this probably isn't the case But the whole thing can't really be settled because original data set this this was based on Which in this case wasn't a social media data set, but this would be true of the same Also be true for many findings based on social media This Couldn't be tested really it's hard to refute an idea if you don't have access to the original data and Also, it's very hard to build up incrementally as a field without these shared data sets You can't say well this person discovered this about this data set building on that building on the communities They found or the structures that they found we're gonna take it a step further and Finally, it's very inefficient for every research group or every PhD student to have to rate their own scrapers and figure out how to Find loopholes in the API's of these sites to collect the data that they need And it's also inefficient from the sense that if you want to collect a data set properly respecting the privacy of users There's a lot of extra overhead with that that many research groups won't have the resources for Or they might be in a rush and so they don't Respect all these issues. So it would be more efficient if we just collected one data set properly the obstacles to that are obviously the ethical questions of privacy basically and Also, the other obstacle would be the threat of service providers So I'll show later on that in some cases people's privacy have already been compromised So the the ethical issue Is is no longer really the main issue the main issues at a site like Twitter says take this data set down Sort of threatening Lee Okay, and there are people many research groups do share data sets. It does happen, but They're lacking. So there are some sites say You can think of flicker as an example where there is an interaction network People it is longitudinal its large-scale it has many of the properties I listed before But the data sets not very interesting in general because it doesn't capture People's social universe very well. They're most important relationships. They're best friends and their family won't be represented in this interaction network Ideally you would have a data set from something like an isolated village of people using Who all had smartphones and all use their social media site so you could track back changes in user behavior to? Your original data. So if someone changes their political opinions, you can find the cause for that change within the data set that you have So of course there are shared data sets But there hasn't been one that sort of a Canonical a very important data set of general use Well, they're actually there has been one that was collected this taste ties and time data set which I'll go over now but for privacy reasons is not being distributed publicly so This data set was from a relatively socially well self-contained social group. It was collected on Freshman students in going into a US private university where people would come from all over the country sort of sever their existing social ties and be living in this little isolated college community and another nice Aspect of the status that was as from Facebook and college students at the time very heavily use Facebook So if you were friends with somebody in college, it's very likely that that friendship would show up in this data set And The group that collected this data they had the resources to manually annotate it to get people to look at each profile and say well if this field wasn't filled in if Their gender wasn't filled in they can look at the picture and tell what the gender is and So manually enrich this data set Okay, and the This taste ties and time data contained information on gender socio-economic status race academic major things sociologists believe are in play an important role in many of these fundamental questions and It also included information on what people's favorite books were their favorite Bands films, etc. So in many ways it was an unprecedented data set it was collected from a small university in New England and it was collected over four years and According to the terms of the funding it had to be made public The ethical aspects of this study were approved by the Harvard in Institutional Research Board and Facebook also was aware of this study and approved of it So there were measures taken to preserve the privacy of the users involved their Names were removed and contact information was removed and many of the attributes say your favorite band So if you had if one user had some a few rare attributes might be easy to re-identify them So rather than telling you what the The favorite books or bands where they encoded them with numbers. So you would know if two people had Book in common, but you wouldn't know what that book was Nevertheless the anonymity of the data set was quickly and easily cracked in the sense that People discovered where the data was coming from it was from the Harvard class of 2009 So the researchers who were collecting this data were also from Harvard So there were some after this data set was cracked. There were some serious criticisms of the whole study and the first is that the The subjects weren't ever informed that their profiles were being scraped and so there was no way to opt out of this study and The second is that the profiles were being scraped from the privileged positions of Other like fellow students. So the researchers in charge of this study hired students at the university to collect this data so At the time Facebook had sort of a three-tier privacy Policy you could share your profile and your friends with only your friends or you could share that with the whole network which in this case was a college or globally and most people chose to share things with the whole network so The people collecting this data had privileged access to Data that the users themselves might have thought was pretty private and then this this data set was going to be released to the public So those are some of the serious Privacy criticisms that arose and the data set was pretty quickly yanked down from their website Here's a note from 2010 it's still offline as we take further steps to make this data set anonymous and this there's still no updates insane I Haven't heard of it being on bit torn. It was apparently released for a little while, but I Haven't seen it anywhere So now I'm going to talk about the Facebook 100 which is I call the evil twin of this taste ties and time data set Here's it was actually a hundred different college Friendship networks. Here's a visualization of the Caltech network, which is one of the smaller ones and This data set appeared in early 2011 Although the data itself originates from September 2005. There's also smaller data set released a few years ago called the Facebook 5 Which was released was which was gathered a little bit earlier and this data came directly from Facebook It came from the CTO of Facebook Adam D'Angelo That's what the paper that introduced this data says And so for 100 universities, we have the complete friendship network and we have data on gender dorm academic major and the high school that the user attended so the friendship one interesting point is that the Friendship network is complete regardless of the privacy settings So the if a user said they only want the friends to know about this about their friendship network and their profile It didn't matter Facebook just included everything basically and they did take some measures to Try to keep this data anonymous they remove the names and they again they encoded the attributes So you wouldn't know if someone was male or female. There was gender one and two And yeah, so all those attributes had the same encoding Okay, now there's an interesting question of could this data set be cracked without much additional information and Yes, it could there was a that reference should say there's a paper from Kleinberg and I think it was backstrom And They were able to show that even if you have a huge network of say everyone in Facebook now or even if included everyone in the world you could If before the data had been collected you would implanted say five or between five and ten nodes in that network and created some random Connections between them you could efficiently find those nodes in the anonymized data set and from there you could de-anonymize more of the network and Even if you didn't actively create accounts to for this kind of privacy attack If you just had information about your network before it was collected It's very likely that you'd be able to find yourself again. So theoretically. It's very difficult to anonymize social network data So I tried to identify myself in this I happened to be a college student at the time when the data was collected and I In half an hour. I still wasn't able to find myself. I narrowed myself down to about 15 people And then I started then I just went to Google and it's not really necessary to crack this data Google one of the first hits said well actually this there seems to be a mistake with this data the original Facebook IDs there was one more node attribute called ID and This probably wasn't supposed to be in the original data set because After this status that was released. There was a slightly appended Modified version that was released and that one was missing this ID attribute and then Facebook asked them to take the whole thing down altogether So The original Facebook IDs are still contained in the data set if You're interested in downloading this data said this one is on BitTorrent I've also created a parser for it if you want to put it into any of a number of network formats So here's how easy it was here is how I found myself It just went to Facebook typed and searched the source code for user and I found oh, there's some number It looks like a user ID went into the data set and sure enough there. I was So that's not too interesting Could we find someone famous so here's Mark Zuckerberg If you type into Facebook Mark Zuckerberg Facebook ID You'll find his ID is actually the lowest of all Facebook IDs. It's number four So you can look in the Harvard network. It was one of the hundred networks included in this data set and sure enough There he is he's 156 friends Which doesn't make him particularly popular in Harvard makes him 4055th most popular person there So it's easy to find people in this data set You probably Write a scraper that would identify there were 1.2 million people's accounts were in this Facebook 100 data set and Probably wouldn't be hard to identify a large number of those Okay, so that puts this taste ties and time data set into context so This data was polled to protect the privacy of users in the meantime though Facebook releases Basically the data that's contained in that set anyway So the Harvard network in 2005 is contained in this Facebook 100 data set So there the user's privacy has already been quietly compromised most of the users aren't aware of this So this brings up the question of why not why don't we distribute the taste ties and time data set or similar data sets? This isn't this is just one example to be concrete But there are other situations where data sets have been posted and then pulled down at the request to the provider So before delving into this question anymore, we should consider two conceptions of privacy the first one is harm-based so and This one says if as long as no harm comes from the fact that I have your data as long as I don't steal your identity Or try to do anything with it that you would disapprove of if you really knew what I was doing Then We haven't done anything ethically wrong So academics who are uninterested in the identities of the individuals. So this is one important part This is one important point academics in general. They don't need to know the identities. They don't want to know the identities They don't want to de-anonymize the network They're just looking for patterns in the data. So according to this harm-based philosophy or conception of privacy This we can use this data as academics and this I would argue is the the conception of privacy that most academics work with the other Conception would be a dignity-based And both of these conceptions were in a paper from Zimmer in 2010 There's a bibliography at the end if you're interested in where these come from the Dignity-based conception of privacy says that if the data is stripped out of its intended sphere Then you're already compromising the basic human dignity of the user. So if you have a data set with millions of people You can bet that probably a few of them don't want their data researched by anyone no matter what they're how benign it seems to the academics So I would say most academics who are looking at large data sets Do not have this conception of privacy. Otherwise, they wouldn't be looking at the data sets So most effective research environments have adopted this harm-based paradigm So now another look at the current policy. You can exploit sense of data for your own academic research For example, this taste-ties and time data set is still being Used by the Harvard researchers who gathered it. You just can't share it and The ostensible explanation for this is that academic use is allowed because academics. We're not interested in harming you We aren't gonna steal your identity We don't even care about your identity. So The reason we don't share it is so it won't be used maliciously by people who do want to do those things and So the assumption here that holds us all together is that what malicious users can't collect this data themselves So as long as we don't distribute it your privacy will be Maintained okay, so this It's not only in this Facebook data set where privacy has been compromised, of course there are a lot of the social media data sets that have been shared and Other data sets like the Netflix prize data set They've been cracked in a way so that users could be re-identified. So this is a persistent problem in this field And it basically boils down to the fact that it's hard to maintain privacy and accessibility for users simultaneously So another example in Germany, there was a network before Facebook became popular here. I was translated into German there was studi-fautseid and That was notoriously insecure in 2007. I think it was an undergraduate student at the Technical University in Munich scraped the entire studi-fautseid friendship network and Others have scraped that one as well. It's So there's an illusion here people believe their data is more private than it is and There are also cases where so to scrape the studi-fautseid data you would need to Login to studi-fautseid, which means you've agreed to the terms and conditions which say you shouldn't scrape it But there are other cases for example Blogger Pete Warden scraped 210 million profiles from public basically data that Facebook had made public and He didn't log in to Facebook at all to do this. In fact, he even respected their no-robots Files so he only scraped files that Facebook had seemingly showed was okay to be scraped But Although this data was completely public Facebook threatened to sue Pete Warden and he had to take it down so The same could be said of other large Twitter data sets or four square data sets or data from other social media sites Okay, and of course the The leaks of data that we really have to worry about go unheard of we don't know about malicious users because they don't try to publish and share their data and Of course the the government also has access to a lot of data, which is something maybe users don't always think about So why does this policy exist? Well one way to answer this question is to think of who benefits from this policy And it's not the users who are just less aware of the vulnerabilities by Basically the fact that they can't be shared publicly It's not science, which is held back for the reasons. I've already stated the service providers benefit For example Facebook or Twitter. They avoid bad press. They avoid lawsuits and malicious users also benefit they The vulnerabilities of these social media websites remain unknown and users who aren't as aware of them Are more confident to share sensitive information? so it seems one possible explanation for this Current state where researchers aren't sharing their data They're not trying to build these data sets is that we're afraid of the the wrath of the service providers such as Facebook so We can say we're not sharing it for privacy reasons But if malicious users are likely to be able to access this data anyway, then why do we have to pretend like the site is private? Okay, now I've simplified things a little which probably distinguish between three different cases one where data can be Collected publicly without any agreement with the source of the data. So the social media site that it comes from And in this case I argue there's no We should be sharing data. There's it's public It should be known that it's public if there's controversy that this data is being shared It will only make the users more aware of the data that they've put out for the public has been scraped and can be scraped the next case would be where It's easy to collect a lot of data within a certain website But you have to log in to do you have to agree to the terms and conditions of the provider and the third one would be Where data has been shared privately with with researchers? So this is a case with a lot of the mobile phone data research has been done the mobile phone providers share data with academics and This in this case I would definitely say the academic should not be sharing the data sets because it's not available to malicious parties in the first place Or not that we know of it's not as easy to for them to access Okay, so yeah, my point again is if service providers leak this data out easily then why should academia pretend like we It's hard for us to access so we can't share it so If to return back to the concrete example of this taste ties and time data set It's sort of a gray area. It would be hard for malicious users to collect this to collect all the attributes that the the researchers collected so It would be hard it would be possible for them You would need an account in the Harvard Network which may not be that hard you would need so I guess to have a Harvard email address or to make Facebook think you had one and They also did a lot of manual annotation on the data so It's I will say it is sort of a gray case. It's not easy to deal with this problem Okay, and now just one last slide on a host of other problems that come in if you try to Enhance data sets so there's been some research done on filling in things people leave blank on their profiles So if you have a social network and you have Some set of users say you have 20% of users who fill in some attribute on their profile say their gender or the year that they graduate or the year they're born in then There's been research that shows it's you can infer with high accuracy what the values are for the other 80% of the people and another interesting example was Some MIT researchers looking at the MIT Facebook Network We're able to show That you can come up with a classifier machine learning classifier in this case a logistic regression model That can accurately predict whether a user is a gay man or not. So this Basically, even if there's not even a field For it on the site or if you leave it blank your friends reveal a lot about you if you're using a social media site and The question is how far should academia push this research? So should we try to enhance our data sets by Trying to fill if you have a data set and you're trying to do some research It's really annoying when there are all these blank fields. You don't know what the gender of this person is You don't know what the gender of that one is so Using such classifiers you can try to fill in all those blanks and should we be doing that? And should we become up with methods that are that can do that very accurately? Okay, so I'll just leave off with that and be interested to hear of any questions Thanks Conrad Lee and we said we'd do a Q&A session now We also have a signal angel in the room who will be taking questions on the RIC channel If there are questions in the room, which I hope there are I'll take them and We have another audio angel at the back for the back of the room. So please raise your hand if you have a question Anyone come on One of you has to have a question on this topic over there, please So my question would be You didn't really talk about the possibility of actually asking the users beforehand So so why is there not the idea of just asking the users if they would like to participate in this study or not? Because I would believe that you could find enough users who would voluntarily like to participate in this study Yeah, that would be possible The problem is it's hard to ask people on a large scale and generally the if you ask a very large generic group of people And they don't know who you are then they don't really they don't know if you're actually how you're going to use their data And they don't really trust you probably with good reason. So it's hard to collect the data set on a large scale where you have Sort of consent of a lot of users say There was a question over here. Who was it? hello, I also work on that kind of subjects and with that kind of data sets and one of the thing I think is most important and that you told is Relationships that researchers and big companies like Facebook develop because actually nowadays There's not a lot of companies who have that kind of data sets only well Facebook and Twitter mostly and At the beginning there was they weren't very sure what kind of relationships they went it with universities and they were perhaps more open than now, but now they have I feel and I have heard that conferences that They're not really interested in outside research on their data sets and really interested in controlling what the Searchers do with their work. So Lately a lot of the research that has been going on on these data sets come from people that work directly or indirectly With Facebook with Twitter. So for the general public It's really a problem because we can't really know on a security or intimate see protection point of view what really is possible with these data sets what market years for example that the Companies that buy those kind of data sets from Facebook or that can access to that kind of information can do with those data It's really a lot of opacity and that's the main problem. I think okay. Yeah, I would say even if They were more willing to work with if industry were more willing to work with academia in this area There would still be the problem they would just work with one university and there would be interest probably interesting research done, but there wouldn't be Replicable you wouldn't be able to have other Academic scrutinizer results. So this is why it'd be nice to have a truly open data set or a few high quality data sets for different fields Is there anyone else in the room? Yes over here There was one point a thing on your last slide that was about if you know Yes Attribute values of 20% of the people you can infer the rest was 80% accuracy And is that the statistical value these 80% or are 80% of these profiles that you calculate then somehow actually correct Or is that I can infer that you know in a network where I know they are almost only women that I Know that statistically most of the others are women too. So is the individual data set for one person Correct and 80% of the time or is it statistical about the whole network? Okay, um, I forget the details of the sort of Metric they used there, but it wasn't the case that say 80% of people were Female in this data set and you just say then you could come up with a very simple class fire That just says everyone's female and you'd have 80% accuracy. That wasn't the case. It was Yeah, I read the paper and was happy with that at the time. I forget the details on it, but you can look at the Reference there. It's in the bibliography. I forget which metric they use. Oh Anyone with a question? Yeah Isn't there a problem with certain groups being over repressed over represented in the data set? Because not everyone uses Facebook and there are some groups who use it more and some groups who use it less And isn't there's a problem when using the data? Yeah, that's true. So that's why for example the The people who created this taste ties and time data set they carefully chose a group of people where nearly everybody used Facebook They I think that in coming class there were 1600 people and Over 1,500 people had Facebook accounts So if you chose a group of retired people, you wouldn't have that same kind of participation. So Yeah, these are if you're looking at social media in general You have to make sure that the the group that you're trying to research as well represented and Be wary of biases that come from different rates of participation. I Think that was a are you raising your hand? I've got a question about Just an ethical question What would you do or do you know anyone who already did? When and when a hack of Facebook for example would be leaked so if you've got a data set That's already public and could already be used by malicious guys Would you use the same data set for? Research and probably the answers no so why not I would say the answer is yes This Facebook 100 data set has like I said is on bit or it and the Facebook 5 data set has been around for a while Now and people have published papers in which they use these data sets now that those were recently just pulled down So I don't know I don't think any of those went through the review process after those this flaw had been discovered of the ID's being in there and they're no longer officially distributed, but There's It seems that the journals themselves aren't Denying papers if they're not like carefully scrutinizing and saying how you've violated privacy In the collection of your data, it's a responsibility of the scientists the journal themselves Don't they haven't taken on this sort of regulation role, so if other groups are researching the data then your your research group will be in competition with them and And then the groups who will be successful in this field won't have scruples about using those kinds of data sets Anyone else have a question, please raise your hand Yeah Yes, during the past one or two years there's been an ongoing debate about Facebook privacy and everything Has this actually Contributed to some changes. I mean, what is the current state? Would it be possible to create something like Facebook 1000 with current data or Have the privacy changes actually affected anything regarding academic research or was it just about adding? Stupid and useless JavaScript menus and buttons So are you asking if the debate about policy has changed Facebook's attitude towards releasing these data sets or Yes, if it would be possible for researchers to get a Current snapshot or some sort of current dump of current Facebook data, so Facebook 100 It's a little bit older, but I mean if the changes in the privacy Conditions and everything of Facebook Still would allow The creation of a new data set with current data. Okay. I don't think so. Well, of course it would be possible but it only becomes more and more clear how hard it is to really anonymize this data and so I think The tendency would be to be more careful with it rather than Distributing in it more like if you're in the position of Facebook, but I'm not sure I'm not sure what the people in Facebook actually think Anyone else don't be shy Please raise your hand if you have a question Anyone no more questions. Do we have a questions a question on IRC? No Anyone now is your last chance All right, then please give it up for Conrad Lee and his talk