 Welcome to my talk. My name is Jeroen van der Ham. I work for the National Cybersecurity Centre in the Netherlands. And I'm also a guest researcher at TU Delft, the Technical University of Delft. And there I also think about ethics in computing, in cybersecurity research in all kinds of stuff. And I want to take you on a journey on how I learned that, the cases that I've run into, and have a discussion about these cases and think about how these cases impact the world and maybe also our work. So four years ago, my wife let me go to OHM. I'm very grateful. And she let me go again. I'm grateful again. But at OHM, I gave a talk about the Pirate Bay Watch. Small recap. Back then there was a court case in the Netherlands, or just a couple of months before there was a court case in the Netherlands where the Copyrights Organization was suing the ISPs to block access to the Pirate Bay website. When this court case was going on, everybody knew that this blockade of the website had zero effect. Everybody knew their way around the blockade, either by using a different DNS server, by using a VPN on a different country, by using an IP address, by using Google Translate, or any other kind of way. So everybody knew that this was ineffective, but nobody measured the effect. And I thought, hmm, this is measurable. You're using Bitorrent. Bitorrent forms a swarm. You can monitor the swarm. You can monitor swarms before a blockade goes into effect. You can monitor them after. And then you can detect whether there's a shift in ISPs present in those swarms. So that's what I did. I wrote a small program that joined the swarm, logged IP addresses, and then I looked at the ratio between the different ISPs before and after certain ISPs blocked. And there was almost no difference in the distribution, which meant that the blockade actually had no effect. So I just spent a weekend putting a Python script together, scraping all of these records. And I logged two and a half million records in two weeks for 10 movies that were mostly Dutch movies, Dutch spoken movies, to make sure that we had a very representative Dutch population. But back then, somebody asked me the question, is it actually legal what you're doing? And that got me thinking, because storing IP addresses is actually not legal in the Netherlands, because IP addresses are considered personal information. And you can only store them when you have consent. And there was no way for me to ask consent to everybody who was using BitTorrent. So this got me thinking. I approached an ethicist to talk about this case. And we looked at all kinds of different scopes on how this actually works. And in the end, we concluded that because this was research, because this was done in the general interest of people to make sure that there was clarity, that this was a scientific result, and because informing them beforehand would influence the results, probably. This was probably ethically okay. Yes. So this got me interested in ethics. And I went down the path of ethics and figuring out whether there were more cases that computer science and ethics collided. Why should we look at ethics in computing? Well, and I made this speech also to a workshop on internet measurements. And there I said, basically, academics and computer scientists are given a very large space to do their research. And they have a freedom of teaching, freedom to carry out research, freedom to express opinions, and freedom from institutional censorship, at least ideally they have. But with great power comes great responsibility. And what you see is that we're seeing now that ethics is becoming more of a part of the discussion in computer science. And also the general public is discovering that this actually affects them, that the decisions that are made in algorithms really affect their daily lives, that we need to have this discussion. But we also need to think about how we implement this into academic research and academic research in computer science. And this is typical of how at least internet measurement researchers think about research. They make a Venn diagram saying use of privacy laws. And then there's a big circle saying useful internet research. And then there's a huge overlap where they're breaking the law, but they're doing interesting research. And this is just not okay. This approach is not okay, because we need to have a discussion about all of the useful internet research. And there's ways of doing a lot of the useful internet research that if you do this naively, you would break the law. But there's probably ways that you can do this without breaking the law. Formal ethics. In the US, at least for every university, there's an institutional review board that would look at these kind of ethics questions. In Europe, there are ethics committees, and each university should probably have them. But the ethics committees in, at least in European universities, they mostly look at plagiarism, professional conduct. And they don't look at the kind of experiments that computer sciences are doing. And if you look at the institutional review boards that we have in the US, they have a very strict definition of what kind of reviews they can do. And that is human subject research. And if it doesn't contain, if a study doesn't contain actual humans, actually living humans that are directly affected by the study, the IRB won't touch that research. And I'll review a case where this is actually where this comes up. What you see now is that there's increasingly becoming funding questions in European research. This morning, we had a very nice talk by the Satori project that mentioned that was built upon the idea of responsible research and innovation, which has become a very cornerstone of the idea of European funding. So now for European funding, European research funding, you need to do responsible research and innovation. You need to demonstrate that you're actually doing that. But it's mostly tick boxes. It's mostly checkboxes saying, yes, I'm doing this, and there's no real check that this is actually done. So the formal ethics, I think, are a very good first step. But we need to make the community morally aware. And we need to discuss these questionable cases with each other without blaming people. And we need to discuss whether this can actually be done without bad consequences. And to help people think about other ways that their research could be done without the ethical questionable methods. So one of the first cases that I want to discuss is the Facebook emotional contagion study. This was a study that was performed by Facebook with the idea that they wanted to see whether they could influence the emotions of Facebook users. And they did this by filtering the news that people got. They looked at the post, the timelines of English speaking Facebook users. They analyzed the emotion that was in posts. And they filtered out into several different groups. And they checked if you get only negative emotions, what happens if you get positive emotions or some kind of mix. And the end result was actually that there was a very small effect that if you get only negative emotions, then you start using posts with less words. But it was a really small effect. But the most important part of this study, of how the study was performed, was actually that they didn't ask the Facebook users for their permission. They didn't even inform them after the experiment was done and the results were done. They didn't inform the users that they were actually part of an experiment. And the university, this was done internally at Facebook, but in cooperation with the university. And the university institutional review board was involved in this study in so far that the researchers actually told the IRB saying, we're just providing the methodology Facebook has the data. We don't access the data. So there's new human subjects in the part that we do. So this was actually, you could say this limits the scope of an institutional review board. It becomes an ethical whitewashing because the actual academic review is not performed of the ethical part of the study. So then the process fails. A different study that was actually performed was measuring censorship. This was a study performed in the group of Nick Feemster and others. The research was that they tricked computers into trying to contact a webpage and then observe the behavior of the computer from the sideline. And then from that they could either see whether the connection was successful or not. But what they actually did was that they tricked the other people's computer to access a website that was censored. So you're affecting, you're doing as if that computer accesses a censored website, which could really affect the people who own that computer. Because all of a sudden they have to explain to their government, to their police or whatever why they were accessing that censored website. And the defense of the internet researchers was saying, we need to do this study because we want, there's no other way to measure censorship. They say they were very careful in the sites that they selected that should be addressed. And in this case, the conference where this research was published, the reviewers said, this is not okay, what you're doing. We really want your IRB to sign off on the research that you're doing. So the researchers went back to the university's IRB. The IRB said, we don't see any human subjects. Don't bother us. There's nothing that we can do about this. The reviewers of the conference, I heard later that this paper was actually sent to numerous different conferences and at some point one of the reviewers, they said, well, we need to have discussion about whether this is acceptable or not. And we should publish this. In the meantime, one of the researchers was actually moved to a different university. So they said, could you please try the institutional review board of the other university? And they did. And the other IRB gave the same result. They said, we don't see any human subjects. Therefore, we can't judge this. So go away. Finally, this research was published with a big comment by the reviewers saying, this is, we don't think that this is okay. We checked with two different IRBs and they said, we can't judge this, but we need to have this conversation whether we accept this kind of research or not. So yeah, that's the big statement that was put on top of the paper. Finally, a different way on how publishing data can actually affect your research. If you have a big pile of data and you want to anonymize the data, you really need to be careful. And in this case, this was not an academic research, but there was an attempt to anonymize the data. But it turned out that the seemingly innocent data of New York taxi rides was actually pretty harmful. The story was that one person living in New York found that the city of New York was publishing a statistic about taxi rides in New York. They had, and he wondered, well, you're basing this on data about all of the taxi rides in New York, so you probably have data on it and I would like to have it, please. And the city of New York said, sure, if you buy a USB stick, then we'll give it to you. And he got two big tables, CSP tables, which contain data on the pickup location, the drop-off location, and GPS coordinates that were very, very accurate. The, of course, the date and the time. And the other table was half sort of correlated, but it contained the price that was paid or the fare that was calculated for that ride, but it also included the tip amount that was given by the client. The data was correlated back to the taxi drivers, so you could see each of the taxis, but it was hashed with a Xiaowan hash. And the result, the taxi was actually the number, so each of the taxis in New York has a number and the number is like five or six digits, may contain some letters, but the space of that is not that big. And you probably guess it, yes, it was insulted. So you apply a Xiaowan hash to a very small search space and you can reproduce this, and you can reproduce it, so you make a lookup table and you research this back. What happens then is that you have time and location of the taxis, but you actually have a lot of pictures of celebrities getting in and out of cabs. And if you have that, then you can correlate it back to the rides that they have. And from that, they concluded that Jessica Alba is a very bad tipper. But the lesson in this is that if you want to publish all kinds of data, you have to be really, really careful in anonymizing that kind of data. And you should really think about that before you do that. One of the things that we talked about this morning, if you are doing research, you have to look at the people that are affected by your research. And it goes a lot further than you would initially think. You yourself are affected by the research because if your research is attacked, then you're blamed about your research, so you have to have a way of defending yourself, which may mean that you have to have more data than you would ethically want to have. Your group at the university or company is a stakeholder in your research. Of course, the subjects, but if you collect data about enough subjects or a very small subject set of data, then it could be that you're affecting subjects outside of the collected data. And sometimes the research that you do also affects your colleagues. Even in your company, or if you do a very bad kind of research, then it may even affect the whole field that you're in. Like I mentioned before, the IRBs only handle federal grants and human subject research. And the ethics committees that we have in Europe are really ill-equipped. They don't have anybody who understands computing. And we've seen how bad this is with politicians, but this goes even further with ethics committees because they have no idea what is going on. And they don't understand all of the problems. And even if a computer scientist, if you, like me, you're starting to wonder whether something is acceptable or not, and you go to an ethics committee and they say, we don't know, yeah, that doesn't really help me. So one of the things that I've been doing is trying to start up this discussion in computing, trying to start up the discussion, at least in the Netherlands, about having more ethics committees in computer science. And like I mentioned before, with the censorship study, in the censorship study, there are no direct human subjects in the research. Or even in the taxicab data, there are no direct human subjects in the research. But through the research, you're affecting the, you're inferring things about the taxi drivers, but also about the passengers and other people that are in this research. So we really should be thinking about human harming research, whether instead of human subject research. I'd like to close this off with a good example, because I implemented an ethical review in a master education program at the University of Amsterdam. And a very good example that I presented also at CCC a couple of years back, but it's still a good case study. In this case, this was about Tinder. And Tinder has had, and still has, a very bad rep for treating their user data. One of the first versions of Tinder is the one that I presented in the CCC, and it actually sent the GPS coordinates of your matches to your phone. So that you could calculate the distance to those matches on your phone. And it sent it over HTTP, and you could just intercept it. There was an open API, and you could just contact it and get all kinds of data. That meant you could basically track somebody who was using Tinder while they were using Tinder. You could see where they were when they were doing it. So somebody contacted Tinder saying, this is not really a good idea. So they thought about it, they thought about it a few minutes, they changed the API, they changed the app, and now they sent, they pre-calculated the distance on the server, and they only sent, so they sent the exact distance to your phone. They still had an open API, which means you could just pick a couple of points, you need three points, and you can locate somebody exactly on where they are. So it took you a little bit more effort to actually locate people. So this was all done, then Tinder said, okay, we'll do better, we'll do better, and they implemented a threshold of 100 meters, and everything below 100 meters that was rounded to 100 meters, and my students wanted to know whether this was actually effective. I said, okay, you can do this, but if you want to do this, then you have to get informed consent of all of the people that you're messing this on. They said, okay, we can do this, and they also created some test subjects on their own, and they went out and started doing some research, they found a way to access the API, they found a way to access all kinds of data, and a couple of days later they came back to me and said, well, you have a problem. We were doing the research at a university, and a very big population of Tinder users are students, so that meant they couldn't find their own subjects back in the big pile of data that they were getting back from Tinder, so they couldn't effectively perform their research, and they said, well, okay, we want to do this in an ethically acceptable way, we have to think about the new way of doing this, and eventually we settled on doing this by hashing the user ID that they got back from the Tinder API, and storing the distance that they got from the server, so from that they could infer whether you could locate somebody back or not, but you couldn't see who it was and what their exact location was, and with that methodology they actually found that they didn't indeed implement the threshold, but in the API calls you can actually set a threshold number from, so the range that you want to search within, and if you search within 10 meters, then you get results that say that they are actually 100 meters away from you, but the set of results that you get back is a lot smaller than if you ask for within 100 meters, so they're only spoofing the number, but the number is actually still there, so with that they found that the Tinder didn't actually implement a good way of hiding the location, and you could still track Tinder users. We tried to report this to Tinder, they never responded, and I don't actually know whether they fixed it or not, but at least we were able to, in my idea, we were able to research this in an ethically acceptable way without affecting directly the privacy of the Tinder users, the takeaways that I have for this talk in ethics matter, for internet research, for security research, basically for generally for computer science research. Think about the stakeholders, think outside of the box, think about people that are indirectly involved in the research that you do, even if it's just a very small research that you do. If you're at a university, the IRBs and the ethics committees, they need help, and they need people to start thinking about this and helping them start thinking about this, and finally don't be afraid to reach out to other people to start thinking about the ethical aspect of security research or internet measurement research, and it really helps to talk to others about the kind of research that you're doing and the risks that you're seeing, and maybe even missing. Unfortunately, there was a really good webpage, ethicalresearch.org, it appears that it went down, and it's not actually available anymore, I'm trying to get it back up or at least figuring out why it's down, but it was a good way for people to actually reach out and discuss these kind of cases. If that page doesn't come back up, I'm always available to talk about research ideas and help you think about ethical aspects of it, and you can reach me through that email address, so the other one or in person here, and with that, I thank you very much for your attention. Questions? So, let's see. Imagine the TaxiCab research was performed at your own university, what would you have recommended the scientist in this case? Because the data set was not his, and presumably, this is a data set that anyone can just obtain, which complicates the matter even more. The data set, he actually had to physically go to the office of the city of New York to get the data. He didn't really think about it, and he didn't know whether there was a problem in the data or not, and he just published it online and said, look, this is cool data, and you can do cool stuff with it, and it was. There was a lot of people who were doing cool stuff with it, but later, somebody looked at it and said there was a problem in anonymization. I would say that if you're starting to do this kind of publication of personal data, or even indirectly personal data, you really have to reach out to somebody who knows anonymization. Anonymization is a very difficult field. It's being researched a little. I'm trying to do that, and I'm doing it for network data, and I'm figuring out that it's basically impossible to anonymize network data, which is a very sad conclusion, but I think that it's a valid conclusion still, or it's near impossible. But for other kinds of data, it is actually possible to do it in your anonymization. In this case, if you would just have salted the SHA-1 hash, then it would have been impossible to correlate it, almost impossible to correlate it back. I would have suggested that that would be done, but once the data was out there, there's nothing you can do about it anymore. Have you looked at the way, for instance, a field of sociology handled this a few decades back, because they had the same problem that observing behavior could be ethically problematic. I think now they also have the medical ethical committee. They have to go by to do all kinds of research, but the other problem is that when you don't have a specific research group of people in a chamber, but you observe people in the wild, just behavior in the wild, that can be problematic with asking consent from people and all kinds of stuff. If you look at that. I have. There's also been a very big study performed by the Dutch Royal Academy of Sciences. They've looked at the ethical aspects of computer science research, and one of the first things that they did is they reached out to the medical community, to the social communities, to see what they were doing. Actually, if you look at, well, there's a very funny story there. There was a meeting. There was a professor of medical science who was at the meeting where we were discussing ethical aspects of computer science research, and he said, I don't get it. We have these medical files, and they used to be in a cabinet, and now they're on a computer. I don't see a problem, but with that, medical ethics is actually a really small field, and it's very restricted. Again, there it's human subject research. That's mostly medical research, but in sociology, it's different. Yes. In sociology, you have the very big idea of informed consent, and you have to sign a consent form if you want to participate in a study, and you have to write the consent form in such a way that people participating in the study, they actually know what they're signing up for, and you can sometimes be deceitful about what kind of study you're performing, but then you really need to tell them afterwards what you were actually doing. Yes, in social sciences, they did do more than in medical sciences, but still the problem in computer science is that we're doing a lot of research that really indirectly affects people, and very often, like the case I had with the prior Baywatch, there is no way at all to actually reach out to the users, and that becomes a problem. I thought in sociology, they also had a method, for instance, for measuring people in the wild, let's say, when you can't ask consent from everyone passing by at a certain point or something, they have some institutions to check that. Right. As far as I'm aware. That would be interesting to look at, but what I can imagine is that if you do that, for one, you're doing it in a public space, and people can be monitored in a public space without their consent, and second, most of the results that they do don't lead back to an individual, and they're mostly about totals and averages. They also come to the discussion of a public place. I'm not going to answer that, but with the research that you do for a social science, like I said, a lot of the research that you can do, you can do without leading back to individuals. And for internet research, it's a lot harder to make sure that something is not traceable back to a person. Oh, back there. Hello. This may be a naive question, but I was wondering, why do you consider about the Facebook research? Why do you consider it the bad thing that users are not aware that they are being part of an experiment? I mean, from a research perspective, you can actually argue that it's a good thing. And yes, they are being manipulated, but people are manipulated all day, for instance, by billboards, ads, etc. It goes back to the slide that I had in the very beginning. If you're doing academic research, you get a lot of freedom, but you also get a lot of responsibility with that, I think. With the Facebook study, there's actually an underlying discussion, whether we think it's acceptable that Facebook is doing this kind of research instead of university researchers doing this kind of research. Do we have a different measure for what Facebook is doing versus what academics are doing? And there's also the fact that for many of the stuff that Facebook did in that study, they do it all day anyway. They just don't publish about it. And that's basically the difference. And that in this case, there was actually a publication about it. And because there was a publication, there was a very big discussion about whether this was acceptable or not. So I guess for academics, there's a higher standard. Yeah. But do you see that as a problem that users are not aware that they are part of an experiment? I mean, what's the problem with that? Because if they are aware, they might be biased. That's true. But then, like I said, for academics, there's a higher standard. If you want to do a research where it's problematic to inform people beforehand, then one of the things that has to happen is that you have to reach out to an ethics committee. You have to tell them what you're going to do. And they make sure that this doesn't actually harm anybody. And that you have measures in place that if it does harm anybody, that you can pull the plug. And that you have ways of supporting the people that are in the experiment. And finally, there's the expectation that you inform people afterwards. But in the Facebook study, they didn't even do that. I would compare it to, for instance, billboards, because they also basically manipulate you into wanting a certain project, your product. You're not warned for it. You're not, I mean, there's no committee involved, probably. No. But for that, we have, I mean, that's accepted practice. And people know about it. And I guess it's part of your upbringing that you're told about these kind of things that you're told that what kind of effects it has on you. Yeah. There was a question back there. Hi. I can see why Facebook's research was problematic from an ethical point of view. I just cannot see how your research regarding Pirate Bay might raise concerns. Now, I do understand the legality. I understand the fact that gathering or storing IP addresses is illegal in terms of previous law. But I don't see it as an ethical. Could you comment on that? Was that your, I mean, you collected those addresses, but you processed this information in order to produce a statistical result, which proved a point. Right. There's nothing unethical in that, but it's illegal the way those addresses were stored on your computer, maybe. Well, so there's actually, there's a very big overlap between the Facebook study and the study that I did is I gathered the data without telling people, so while I was deceiving people in giving me the data, because they were, I joined the swarm and they were expecting me to help them in promoting the bit or in process and I didn't. Or at least I didn't do, I also collected their data, which is something that I don't get permission for them to do. For in fact, Facebook at the time of the contagion study, the user license agreement didn't contain scientific research as part of something that you provide an agreement for. And it took them a couple of months and then they did add it. So now they have your permission to do academic research on all of the data that you provided to them. Thanks. Any more questions? Then I thank you very much. Oh, there's one question back there. I mean, so I think you brought up some good questions and I would say that we should make sure and looking at research that we don't conflate legality and ethics going in either direction. But I just wanted to kind of, so most of my friends are not technical people and a lot of them found the Facebook study really offensive because they felt like they were being deeply, deeply manipulated. And also just with it, like you said that the result for people who are shown only negative content was very minor and it only led to a decrease in the number of words in a post. But I would argue that we don't actually know what the number of words in a post means. And the number of words in a post may actually correlate to something like a significantly higher rating on like a depression index. And so while we can look at and in something like this sort of research, we can look at the outcome and go, oh, well, that's minor. We may actually be seeing what looks like a small result that correlates to a much, much larger actual human impact. That's true. Yes. I think they did try to look at that. And of course, Facebook has a lot more data about people that they didn't publish or that they didn't take into account in the whole analysis. I don't know whether they took that into account to judge whether people got depressed or not. I mean, and I guess also just like there are a lot of assumptions built into like the emotional impact and assuming that the only measurement of it is Facebook engagement. Sorry, I went to grad school for public health. So I think that there's a lot really, really deeply wrong with the Facebook study, even if it seems pretty minor. Yeah. So I guess the conclusion is that people are very sensitive when you start messing with their emotions. All right. Thank you very much for your attention. I'm here if you want to discuss more and enjoy Shaw.