 So thank you all very much for having me here today. It's a real pleasure to be here to talk with this group. And thank you very much for the opportunity to also reconnect with Dr. Picanin. Dr. Picanin, of course, was one of my very early mentors in my PhD. So it's a real pleasure to be here and hopefully I do her proud. So I'm gonna talk today about Twitter research ethics and thinking about ethics sort of beyond the review ward. And my talk is broken down into three parts. So I'm gonna move essentially from quite broad into sort of a specific example. So I'm gonna start today by talking about ethics very generally what we mean by ethics and the use of publicly available data in relationship to these different meanings. Next, I'm gonna talk about some research that I've been involved in in terms of understanding how researchers themselves are going about their ethical practices when using data from Twitter and talking about a little bit about the gaps between what researchers do and what users actually think happens to their data and some of the problems they're in. And then third, I'm gonna talk about a very specific example of values tensions that exist in relationship to data sharing around Twitter and some of the tensions between what we're sort of required to do by rules and regulations and what our ethics may demand and how we might be able to go about practically solving some of these issues. So I wanna start today by thinking really, really big about what is ethics, why do we care about the stuff? And when I teach my undergraduate students in my information technology class about ethics, I go back to this idea, this very old idea that's been with us a very, very long time that ethics are systems of principles that we use to guide us in making moral evaluations. And we can rely on things like utilitarianism to help guide us in terms of making decisions and evaluating decisions based on what's gonna provide the maximum good. We can look at something like content and deontology which is gonna suggest we should evaluate a given action based on what our duties are in that particular circumstance. And we actually might ask not just about act evaluation, but also about character ethics, virtue ethics, for example, is going to ask us and help us reflect on the kinds of people that we want to be. Are we instilling the kinds of values in ourselves and in our actions that are virtuous? And at their core, these ethical principles are really in systems are really about using our capacities for reason, judgment, and thought to critically examine our own actions and our own character. And so I really want us to keep this idea in mind, right? That ethics is about in part using our capacities for reason and judgment to think through our actions. Now, as researchers, we often talk maybe colloquially about ethics as sort of that regulatory piece, that compliance side of things, right? Maybe you're having a hallway conversation with a colleague and you say, oh yeah, I have to do the ethics piece of my project now, right? Going through the IRB, getting the paperwork for informed consent approved. Certainly I go through this, I'm sure all of you can go through this with your work as well. This is about a process of ensuring conformity with relevant laws, policies, and guidelines. And of course, these laws, guidelines, and policies are developed in relationship to particular ethical principles that for example, Dr. Buchanan mentioned at the very beginning, respecting human dignity and autonomy, maximizing benefit, minimizing harm so you can see the trace there of utilitarianism thought in that particular value. Justice and beneficence. Now, ethics as compliance is really about ensuring that researchers don't violate certain sort of baseline conditions for the treatment of others essentially, right? These were set up to make sure that researchers do not violate what I would call flooring level needs to make sure that people don't violate the basic conditions for how we should behave towards one another. That doesn't necessarily mean that that is the end of our ethics though. And in fact, compliance and contemplation may not necessarily be playing in the same sandboxes as Dr. Buchanan reminded us. Sometimes, compliance doesn't cover ethical situations that we may encounter as part of our scientific research practices. And very often when using public data on from social media sites, we're getting more and more, we're getting better policies today than we had say a decade ago, but there's still incredible gaps where the rules and policies that were developed for the biomedical setting may not apply cleanly to the context of the social media using public data. And part of this is because at least in the US, and I'm not an expert on EU policy, so I have to apologize there, but in the US at least, public data is often not considered to be, the use of public data is often not considered to be research involving human subjects and therefore not subject to the kinds of oversight of institutional review boards. But again, this is not to say that these research projects lack ethical components. It's simply to say that they fall out of the purview sometimes of the compliance side of ethics. And I want to make the argument here that ethics and our ethical reflections and our actions and evaluation of our actions needs to be considered across the entire process of doing research. And certainly there are in fact a litany of ethical quack miners that we may encounter. These can happen during the data collection process, the data use process and the data sharing process. So the entire gamut of the research process essentially. So certainly in terms of data collection, we might think about, for example, data that was perhaps public once, but it's now been deleted. There may be ethical questions about how we should treat that data. We might be using the data from marginalized groups. It may be public data, but we know that there's been overburdening of particular populations. And so even when we're using the public data, we still may have obligations because of the historical injustices on those groups. Aggregation of data points to create a very detailed picture of someone's life can be a threat to someone's privacy and may have ethical dimensions. And there's questions that we should ask about even when it is public data, if someone contacts you and says, I would like my data removed from your dataset, whether or not we should honor that request. As part of data use, we might encounter questions about the actual ends of our project. So a really interesting example actually in just the past couple of weeks, researchers at the University of Minnesota had a project that fell out of the compliance side of oversight where it was not considered to be research involving human subjects, but where they were purposely introducing errors into the Linux kernel in order to study whether or not the development community would actually find those errors. So we can recognize that there's some serious ethical issues about, for example, purposely introducing errors into a system that thousands upon thousands of people rely on, even if the IRB says, oh, it's not research involving human subjects review. And sometimes we also, I also like to use the example sometimes of projects that scoop up lots of public image set data to do things like use images, pictures of people's faces to try to predict the political leanings or try to predict sexuality. And some people have labeled this as sort of phrenology 2.0, right? Obviously has severe ethical implications, even though it's public data. An important thing that's actually come up really recently that I do wanna particularly encourage this group to think about us are studying misinformation and disinformation. That's been actually, that's come up in the context of research around Gamergate is not just thinking about your obligations towards your research subjects, but also thinking about your obligations to your fellow researchers and to, for example, your students. If you are studying by Trollic content, content that could be psychologically harmful, that maybe it's a violent information, maybe it is disparaging information, finding ways to make sure that students or your fellow researchers have the kinds of support mechanisms they need to be able to engage with this work and have essentially support if they're feeling harm themselves by the kinds of content that you're analyzing. And then finally, in terms of data sharing, there's some really serious ethical questions around how we represent our data subjects. So for example, the power of labeling. So an example might be something like if we create a data set and we label this data set tweets from people that we think have depression and then make this publicly available, the way that we're representing that data subject, those individuals can have implications for them. And so we have to be reflective of the implications of how we represent our subjects. Certainly in terms of data sharing, we also wanna think about things like other ways of sharing our research outputs with the communities that we're actually studying. This can obviously have a lot of benefits for the researchers in terms of developing connections, but can also be a question around justice. And then finally, certainly when we're thinking about upholding values for science, replicability is a really big ethical issue. We want to make sure that the work we're doing is valid, that it's reproducing and that we can benefit science and human knowledge going forward. So the thing I really wanna emphasize in this first part is that there's no singular ethics portion of the research project, right? It's not just something that you do upfront, it's a process of continuous evaluation of our actions, of our character, where we weigh our values and duties. And I'll point very briefly to a framework that I think is very useful in helping us go through that sort of continuous process of evaluation about maybe conflicting values or duties or values and duties that are intention or tensions between what we're asked from a compliance side and what we may feel from a contemplative side. And that is the process of using disclosive ethics. This is a framework put forward by Philip Bray that asks us to sort of continually engage in a descriptive process of what the values, tensions are or where there's gaps that we're noticing between different values and a normative component, how we then go about actually addressing those gaps once we have described them in full. So I wanna pivot now to getting a little bit more specific. So that was sort of my broad introduction thinking about ethics and questions in relationship to the use of public data. I wanna talk now about Twitter specifically. So obviously Twitter's become a major source for academics. There's been over 2,000 research papers published in the past three years using Twitter data. Projects obviously include sometimes billions of tweets now. And we certainly have a reliance on Twitter not just because it's become a very dominant space for political discourse, for responding to events, but because also it is easy to get the data there relatively speaking. And because the data that we can actually get is a kind of data that we can digest very easily. It's kind of data that is very easily parsable for machine learning applications, textual data. Obviously much easier to process than image-based data, other kinds of video content and so and to make inferences from it. And because comparatively it is relatively public. Now I have public there and scare quotes and I'll unpack that a little bit in just a minute. And I wanna make note that as part of a research project I did with Michael Zimmer about five, six years ago we looked at research that had published using data from Twitter and found very few of these projects actually report about going through ethics review. Now that isn't to say that they didn't but only about 4% of the published research that we were able to find actually talked about going through IRB or talked about their ethics processes. And I actually do wanna emphasize that I do think it's quite important that we talk about our ethics that we include it as part of publications and include that sort of descriptive component and our evaluation of it. So I wanna talk for just a second about a research project that I was involved with, with Dr. Casey Feisler at UC Boulder where we sort of recognize that Twitter data is becoming more prominent in terms of its use in the academic setting. And we kind of had the question, well, what's the other side think, right? Do users actually know their content is being used for academic study and how do they feel about it? So we surveyed Twitter users asking users whether or not they think that researchers are allowed to use their content without having to get reconciled. And then we started asking them questions about their level of comfort with the idea of their tweets being used for research. And in particular, we were trying to understand sort of the contextual factors that might drive users' level of comfort with the idea of their content being used or used in different ways. Such as whether or not they were asked for permission by the researcher, the kind of study it was, the kinds of content that were being analyzed and whether or not they were, for example, quoted directly in the study or indirectly. So the top line statistics that I thought were really striking were that we found over 60% of our respondents actually thought researchers were forbidden by Twitter's terms of service from using public tweets without having to ask the users for their permission. That is not the case. This is an incorrect understanding. Now we asked, we told them after this question that actually they are allowed to do it, but would you like them to get your permission? And 65 roughly percent of our respondents thought that researchers shouldn't be allowed to use tweets without permission and without having to go back and ask the user. However, and this is the thing I do wanna emphasize that many users are actually somewhat comfortable with the idea of their content being used if asked, if they can see the research outputs, but this is extremely contextually driven. So I'm not gonna sit here and read this entire chart for you, but I want to emphasize and just try to give you a quick read of what's going on here. So these are the different contextual factors that we asked about on the left-hand side at the top. This is essentially a ligard scale between very uncomfortable on the left-hand side here and on the right-hand side, very comfortable and scores in darker blue are higher prominence. So for example, users are very uncomfortable with the idea of their content being used and never being told about it. Researchers, excuse me, users are very uncomfortable over 50% indicated they would be very uncomfortable with someone using their tweets from a protected account, which is a kind of privacy that you can invoke on Twitter that controls the distribution of your tweet. People were very uncomfortable with the idea of someone, a researcher using a tweet that you'd created even if it was public, but that you had later deleted. Now, where would you see increasing levels of comfort? Actually, and I think this is kind of interesting is the idea that if my tweets are being used in a big data set, maybe I'm a little bit more okay with that. So for example, people were much more comfortable with the idea of if I was just one person in the scope of a research study that's studying a billion tweets, I'm a little bit more comfortable with that than a situation in which a researcher is only studying a few dozen tweets or a few dozen people. Interestingly, users are much more comfortable with the idea of an algorithm, analyzing their tweets rather than a human, which I think is kind of interesting. Folks are very uncomfortable with the idea of researchers using not just your tweets, but also other information in tandem with tweets. So for example, public profile information, such as location and username, and folks are pretty uncomfortable with the idea of being quoted in a research study with their Twitter handle attributed to that quote. They're a little bit more comfortable with the idea of their tweets being used in studies if they're attributed anonymously. So there's some interesting contextual findings here, but the big sort of things I wanna draw our attention to are what I think are potentially tensions in some of these results. So users' understandings of how researchers are actually using their data are pretty limited, in my view, based on this data, but they do have contextually driven levels of comfort, which I think are important to acknowledge and to understand their perspective. At the same time that we recognize from this data that there's some serious gaps in users' understanding, it's also important for us as researchers to promote certainly the progress of science and values that sort of, again, serve that wider public of addressing misinformation and abuse that are happening in a much more macro level. And one of the things that people have talked about, for example, is, well, is there a way that we could perhaps think about notifying users when we're using their data for research, but in some situations, actually that notification process itself could cause users more anxiety that they're being watched or being studied than actually no notification. And so I just wanna point out here that there's different tensions at play and it's really important to describe this space in order to understand these tensions. So again, so I'm moving now from really broad talking about ethics and public data, to talking about Twitter, to talking now about a very specific context, which is data sharing and exploring some of the tensions in relationship to data sharing on Twitter. So obviously in terms of the progress of science, a very important part of upholding scientific truth is ensuring the validity of our work that's part of and also ensuring public trust. A big way that we do that is by providing our data and providing our methods in order to try to create opportunities for replicability studies. And this is really important. Obviously there's been a lot of concern about a crisis of replicability in scientific research, particularly in the social sciences in the past decade or two. Now, in this particular situation, thinking about the context of Twitter data, one way that we can enhance the sort of benefit to the broader community, scientific community, is by making our data sets available for use by others. But there are some tensions here obviously on the compliance side of thing. Twitter's terms of service forbids us from sharing the full JSON data. So this is the full data of a tweet. So we would actually be essentially told that we cannot actually provide the content of a tweet in terms of resharing certain metadata fields we're not allowed to share. Twitter's terms of service actually only allows for researchers to share with third parties of the tweet IDs themselves. Further, our local laws may or may not allow resharing of data. And at the same time, we also might be asked by our funders. So for example, National Science Foundation might ask me to share my data as part of trying to ensure open science practices. So we already have tensions just in the compliance side of things. From the sort of ethical contemplation side of things, we realized that users themselves may not want to be in a labeled data set. Again, think about a labeled data set of tweets that we think are from people who based on the sentiment of their tweets, we think may be prone to depression. Labeling someone in this way could have implications for their privacy if I can look up a tweet ID and then be able to see the user that it came from. We also might ask about whether or not there's an even distribution of risk that's being shared among the population in relationship to the population that we're studying. Again, thinking back to situations in which there's marginalized groups that have historically already been overburdened or have had injustices committed against them in relationship to scientific research. And then we also, from the contemplative side of things also have to think about, well, we still want to uphold open science, the validity of our work. What happens if data starts disappearing from this data set? So a researcher at the University of Maryland at Summers did a project not too long ago where he looked to see how much content from a data set had been deleted after a year, after an event and found that somewhere between 10 and 15% of his data set was gone, essentially, where users had simply just deleted that content after the event occurred. And that really threatens our ability to recreate an event, to understand a particular social phenomena. So we have a lot of different tensions here. It's really important for us to describe these tensions and think about ways that we might go about solving these tensions. So I want to draw your attention to one project, the Documenting the Now project, which has developed a set of code that they call the Hydrator that lets researchers easily rehydrate grabbing all of that extra JSON data from Twitter's APIs, simply based on a list of tweet IDs. So this is a piece of code that helps sort of address the fact that Twitter's terms of service requires us to do one thing, but that we may feel that is in tension with questions around difficulty of actually going about doing this process. Certainly we want to think about in terms of data sharing, finding ways to anonymize user data when possible, particularly if it's potentially sensitive content or if we're doing a kind of sensitive labeling of a data set. We might think about ways of addressing these tensions by doing things like making our data sets available only by request or setting up specific kinds of resharing agreements. So there is a series of data sets called the eRisk data sets, which is a set of data sets used for developing better machine learning models for doing things like predicting self-harm, predicting depression. And this is a set of tweets from individuals that have been, have a baseline where they have indicated that they have had a positive, for example, diagnosis of a particular mental state or have had a positive case of self-harm. And essentially that's the ground truth that they're operating against. And so this data set, you can get access to it, but you can only get access to it by request. And there's a specific user agreement that the researchers have set up with terms to limit what researchers can actually then go and do with this data set. And the data set itself has been scrubbed already of as many identifiers as they can possibly, they think realistically scrub from it. But the terms of service that the researchers have set up for reuse also includes terms like making sure that future uses can't try to positively identify you're forbidden from trying to re-identify the individuals from whom this content originated. You're forbidden from resharing this. So thinking about ways of setting up your own terms for data sharing is one way of trying to address the tensions. And then finally, the thing I wanna come back to is document your decision-making. Discuss your decision-making. One of the big difficulties of coming up with ethical standards in this space is the fact that very often we don't report our ethical thinking and our ethical decision-making as part of our publication practices. Getting to a point where we talk about our norms and normalize them is critically important for this space. So again, our ethical evaluation needs to be a continuous process through research, the entire research process from A to Z, and we need to document that and discuss it. I will leave with this note that certainly the Association of Internet Researchers has been thinking about this particular ethical questions involving internet research. They've used public data for decades and they've published two ethical decision-making guides that are available online. The thing I really like about these guides is that it's not a compliant side model that if you just do these things, that you will be okay. It really is a series of prompts that are set up to help us think through the particular challenges that we might have as part of doing our practices. So it's really there to help us think out potential situations. And so with that, I'd like to say thank you.