 Okay, I guess we're gonna get started here. Thank you for joining us today on this early December afternoon. My name is Thomas Padilla. I'm the Deputy Director of Archiving and Data Services at the Internet Archive and I'm joined here today by my colleagues, Rachel Sandberg and Tim Volmer. We're gonna talk today about legal literacies for text and data mining, otherwise referred to as LLTDMX. So if you've ever seen that acronym out in the wild, you have come to this session and you now know what it means. So congratulations. Our project was funded by the National Endowment for the Humanities. We received support in 2022 to focus on trying to assess and provide some resources for digital humanity scholars who are seeking to do text and data mining with a foreignly held or licensed content. And or are part of international or internationally distributed research teams. This is basically the roadmap for our time with you today. So we'll start by talking a bit about the origin of the project. Why study these issues at all? What are libraries doing engaging in this space? We'll move on to describe our approach in trying to address these issues quite experiential in nature. Involve some writings and virtual round tables and analysis with digital humanity scholars and also a set of internationally distributed copyright experts to weigh in on the issues as well. So sort of like some centralization and then also decentralization in terms of strategy. We'll then proceed to talk about some of the tools that we produced in the course of this project as well as lessons learned that are expressed in a white paper and a case study that is meant to guide researchers and of course their library colleagues because as we know in libraries we are often in the position of trying to help researchers navigate a pretty complex space. We'll then close by talking fairly briefly about future or futures for this project, for this effort. What we see is being sort of next steps of value. So with that I'm gonna pass things over to my colleague, Tim Volmer. Great, thank you Thomas. So it's a little helpful to start by understanding text and data mining and of course we'll use the acronym TDM to describe this. So text data mining is an algorithmic approach to research and text data mining allows scholars to use software to analyze, classify and extract information from things like texts, images and data. So let me illustrate. So suppose we have a book like Pride and Prejudice. Now of course we can read Pride and Prejudice with our eyes and we can appreciate the prose and we can understand the story but we know there's a lot of latent information that's stored inside that book that we might not immediately be able to glean just from reading it with our eyes and that's really where text data mining processes come in. So for instance, text data mining can help us understand things like how many female versus male characters are there. What types of words do female characters use opposed to male characters? And even what types of behaviors do female characters display relative to males? Now using algorithmic extraction techniques allows scholars to study across an individual work but more importantly to study huge collections of work at the same time. So basically what TDM can be used for is to study wide scale social and literary patterns and relationships across a huge volume of data that would otherwise be impossible to sift through. Now while these text data mining methodologies can offer a great potential for research, they also present scholars with some tough law and policy challenges. And we've identified four main issue areas, copyright, contractual agreements, privacy and finally ethical considerations. So here's an example. So say a social science researcher is downloading and analyzing harassing speech on social media posts and the researcher wants to be able to share their data sets to encourage the reproducibility of their study. So this scholar would need to address matters of copyright like are these posts protected by copyright? Do they need to get permission from the copyright holder to use them? Or does an exception like fair use enable text data mining? Contracts, so do social media websites impose terms of use that would limit what a researcher can do? And also do things like website terms and agreements override copyright exceptions? Privacy is another issue. So do these posts reveal information that infringes upon federal or state privacy rights of the persons that are described in those posts? And would republishing that data constitute a further privacy violation? And then finally ethics. So could downloading and recirculating the content create more harm to the subjects depicted in those posts? So together these copyright contracts, privacy and ethical issues that researchers have to grapple with are what we call these legal literacies for text data mining. So we received a previous grant from the National Endowment for the Humanities in 2019 to study this and our project was called Building Legal Literacies for Text Data Mining. And this project allowed us to host an institute to train the first set of scholars and academic staff on really navigating these law, policy, ethics and risk issues within humanities and social sciences text data mining projects. So we recorded and released a collection of training videos which are all available on YouTube. And we also wrote and published an openly licensed book to guide both researchers and librarians on all of these issues and also provide some guides about how to teach these lessons to other people. Now of course we all know that research is increasingly international in scope and practice and in helping researchers understand these TDM issues in the US law and ethical context, they also raised cross-border issues that needed to be addressed. So these include situations which the materials that researchers wanted to mine are housed in a foreign jurisdiction or they might be otherwise subject to foreign database or licensing laws. The human subjects they're studying or who created this underlying content reside in another country or the colleagues with whom the researchers are collaborating with reside abroad which might create some uncertainty about which country's laws, agreements and policies apply. So this really formed the basis of our LLTDMX that's the cross-border project. So we know that the ongoing uncertainty from text data mining researchers about those three cross-border scenarios really became apparent. So we conducted an informal survey in the run up to our LLTDMX project and we found that 70% of the respondents reported cross-border copyright questions, 72% reported uncertainty about cross-border licensing terms and as you can imagine, this confusion impacts the potential for text and data mining research altogether. So we heard that 28% of the respondents said that these various cross-border issues impeded or prevented their project entirely. 40% reported hesitation to share their workflows, their methodology or their sources. Some scholars, we heard, they slowed down the project because they didn't know what problems it might lead to and then others tried not to ask too many questions because they were concerned that the law wouldn't allow them to proceed with their project. So we know that some of these cross-border problems are hindering text data mining research but what are these problems exactly? So when we started to speak with researchers about the specific kinds of issues that they faced, here are some of the examples of the questions that they brought up. So one is whether researchers can assemble and conduct mining on a corpus composed of materials published or licensed from foreign countries and whether the foreign countries copyright rules apply. Whether researchers can create and share a corpus with others located at institutions outside of the United States, particularly when these collections contain materials that are licensed to a specific institution. Whether text data mining researchers in the U.S. have to comply with privacy laws in other countries. And then a final question was how to address privacy and ethical concerns when doing TDM on materials like diaries or personal letters, when the authors of those materials live abroad and didn't create those materials with the intention that they be used for text data mining research. So we designed our research study to answer these questions in a way that would be useful for our researcher participants as well as other people. So first we asked each TDM researcher in our project to write up a two-page description of their text data mining methodology and also any other questions or challenges that they faced with related to cross-border text data mining. We circulated these written statements to our project experts in advance of the first roundtable that we hosted so that our experts could really familiarize themselves with the projects and so that they could prepare some probing questions to ask during our roundtable discussions. Then during the first roundtable, we asked each of our researchers to share a three-minute story in which they really discussed their text data mining project and also raise the cross-border issues. And in two other roundtables, we asked our experts to help us identify and describe in detail the specific legal and ethical challenges that they observed and also reflect on the kind of guidance and education the researchers are going to need to manage those challenges. So following after the roundtables, we charged each of our experts with providing some written feedback to at least two of our researchers. And this way we really wanted to give the researchers some specific and tailored feedback right away so they can understand how they can address specific issues that were relevant to their projects in real time. And then finally, our project team used the roundtable discussions and the analysis from our experts to identify several key takeaways. And these really helped inform the development of our white paper and our case study. And I'm gonna turn things over to our co-director, Rachel Sandberg, to discuss these. Hi, everyone. The white paper is great and I really encourage all of you to read it closely in detail. We can't talk about all of it today, but what I wanna do is just highlight a few of the key learning takeaways because I think it can help inform the work that you do. So first of all, the LLTDMX project has indeed confirmed that the researchers' uncertainty about cross-border legal and ethical issues impedes them from taking on cross-border research questions and from partnering with scholars abroad. In the practitioner's written statements and roundtable discussions, the majority of them noted that they could not see any way forward with their digital humanities TDM research at all as a result of their legal concerns. We also saw that education can help. So during the roundtables, the practitioners were actually surprised to learn that their perceived copyright hurdles were not insurmountable because of the availability of US fair use exceptions and the opportunity for researchers to disseminate analysis or derived data outputs rather than the underlying corpus. Conversely, practitioners expressed equal surprise to learn that often the more decisive hurdle in their DH text and data mining cross-border research would be negotiating contractual rights to share corpus content with other researchers. Second, in addition to demonstrating the need for education on cross-border issues, LLTDMX revealed the need for ongoing education regarding US-centric LLTDM literacies. So Tim mentioned our previous building LLTDM Institute in which we demonstrated the efficacy of design thinking as a way to teach these legal literacies for text and data mining. That approach yielded increased researcher confidence, but that institute which we offered in 2020 was for 32 individuals. And while we then released that open educational book to expand its reach and impact, the vast majority of DHTDM researchers continue to lack formal education about the legal and ethical nuances of text and data mining in the US. And LLTDMX surfaced all of that. For instance, multiple practitioners described fears and hesitancy about proceeding with mining copyrighted protected materials that were published in foreign countries. They felt that foreign copyright laws would prohibit them from conducting text and data mining. Yet, particularly that the US participates in multilateral treaties like the Berne Convention, the law of the country in which the TDM is performed governs the infringement analysis. And the law of the country in which the works were published is not controlling. So US TDM researchers can rely entirely on US copyright law and the parameters of its fair use exception and researchers were relieved to learn this. So ongoing US LLTDM training is an essential counterpart to the cross-border instructional modules that need to be created or extended. Third, disparities in national laws may incentivize text and data mining researcher forum shopping and exacerbate scholarly bias. National differences in copyright contracts and privacy laws across jurisdictions have outsized impact on and can incentivize researchers selection of particular corpora to work with, particular regions to study or particular collaborators to partner with. And I can use national variations in copyright laws to demonstrate this point. All countries have implemented copyright exceptions to creators' exclusive rights in order to support activities like scientific or scholarly research. Some of these exceptions may also, by statute or through judicial interpretation, also authorized text and data mining research. But approximately only one-fifth of countries' research exceptions are broad enough to permit the full range of activities that are needed for text and data mining research, which requires the ability to copy, share, and analyze entire works in collaboration with others. As recently explained by Sean Flynn and others, some countries have research exceptions that permit uses of only excerpts of a work, so for example, Argentina, or do not apply to books or other kinds of works, and that's most post-Soviet countries, or have what's called a private research exception, which would mean that scholars can't share the works with others for analysis or require membership in a specific, sorry, Spain is an example of that, or require membership in a specific research institute, and Sweden is an example of that. The resulting impact of these variations means big problems for researchers, and we can think about the following example as indicative of that. So imagine you have a US text and data mining researcher who wanted to partner with a scholar in Spain on a TDM research project. Some of the corpus would need to be created and downloaded in the US, some in Spain, and then they'd need to share it with each other in order to do the analysis. The research acts in Spain are first governed by the digital single market directive, and under that, that provides that text and data mining is fine as long as there is no subsequent dissemination of the underlying corpus publicly. But that's only the first step. The digital single market directive is a legislative act that sets out a goal that European Union countries must achieve, but it's up to the individual countries to actually devise their own laws on how to reach those goals, and national laws have a margin of difference in this. When you actually look at what Spain's law is, it's one of the private rights countries, which would mean that the Spanish scholar doing text and data mining there would not be able to share the corpus with other researchers either in Spain or abroad. And as a result of those kinds of variations, I just gave copyright as one example, but this also applies in contracts and in privacy and in ethical considerations. You then have potential resulting bias in the kinds of questions that researchers can study or the kinds of materials they can use or the people that they choose to partner with. WIPO, the World Intellectual Property Organization, is considering this fragmented landscape, and at least with respect to copyright law, addressing whether harmonization is desirable or possible even. But I can say with assurance that in areas of privacy and contracts, which differ wildly across countries, there isn't even really hope for harmonization across national borders. Fourth lesson we learned was that the license agreements often dominate analysis of cross-border text and data mining permissibility. For most of the kinds of research people were describing, fair use under copyright law would have allowed the text and data mining itself, but the license agreements that institutions are signing were really what controlled whether or not the scholars could use the materials from other countries or share their corpus with their foreign colleagues. And that is because national laws on what's called contract override vary. So in the United States, in this free market economy of private ordering, we allow contract override, meaning even if you have a right under copyright law, like a fair use exception, you can enter into an agreement that overrides that right. So this is an example of all of the license agreements that we sign in which we have to actually negotiate to get back the rights we would have had under fair use for our scholars. Not every country has that. In fact, around 40 countries, including the European Union countries, prohibit contractual override. So if a group of scholars in Europe wanted to do text and data mining, they would be incentivized not to work with US colleagues because US colleagues would potentially not be able to share the research corpus back with their European partners. So there really is no easy way to address this issue of contractual override. I think we'll be talking a little bit about it more tomorrow in the closing plenary, but this is a real key factor that influences the kinds of cross-border text and data mining that can happen. Another thing we'll be talking a little bit about tomorrow and it turns out that that's me who's gonna be talking about it, the emerging lawsuits about generative artificial intelligence which can impact our understanding of what's permitted in text and data mining and in cross-border text and data mining. Because we're gonna be talking about it tomorrow, I'll just say that one question that's going to get resolved in the kinds of lawsuits that have emerged with the generative AI models are the issue is of whether training the models with copyright protected content is fair use and then separately whether or not outputs from generative models potentially infringe the training content. Now, in a lot of text and data mining, generative AI is not an issue. You can train non-generative AI to make various assessments and determinations across a corpus of work, but in any case, discussions of generative and non-generative AI have really emerged most in the past six to eight months and our institute was 10 months ago. So the researchers weren't really grappling with these issues. So whatever guidance that we've developed may need to be tweaked a little bit depending on how these lawsuits shake out. Overall, we also learned that even if we can sort out all of these technical legal issues, it's very hard to quantify risk to researchers. There are questions for them about how proper it is for a US researcher to be brought into a foreign court and whether or not, even if a foreign judgment is issued against a US researcher, would a US court enforce it, which is a predicate to that foreign entity being able to recover damages? And there are non-quantifiable issues of risk too. For example, what effect does it have on a researcher's reputation if they violate license agreements to do cross-border text and data mining research? And that's discovered in their methodology and then their paper is retracted or a publisher refuses to publish it, which has actually happened. There could also be risks to the people that they're studying because they might be exposing information that, as Tim mentioned, that the content creators didn't originally intend to be used for this purpose. So how does the researchers own ethical values and considerations of ethics feed into the quality or reputation of the work they're producing? And lastly, there is an opportunity for institutional review boards and various campus partners to work together better to support researchers. I'm gonna let Thomas talk about this in the future's bit, but I will just say that institutional review boards are set up to enforce what's called the Federal Common Rule. The Federal Common Rule exclusively looks at whether or not you obtained consent for human subjects research. It does not address issues of copyright. It does not address issues of license agreements. It doesn't even address other kinds of privacy issues beyond consent and IRBs also don't address ethics. So right now IRBs are playing a very limited role in providing guidance to researchers on text and data mining, and then you factor in that we're talking about cross-border text and data mining. So how is the IRB going to be able to guide the researcher on not only all of these things with respect to US law, but how these are treated across countries? We really have to look at campus partnerships for that. I do just wanna say that I also think there's an opportunity for institutions to enact an affirmative policy that basically says, look, researchers, we realize you have all of the onus right now on trying to figure out whether or not this is permitted. And if you make a good faith effort by using certain guidelines, we've got your back. This is what a lot of institutions do already with fair use policies in the University of California. If we make a good faith effort to comply with fair use, the University will defend us. And that's where I think we need to go to support text and data mining research with or without AI. Quickly, I will just point out our case study, which is also really good, to let you know that while all of these recommendations are great and help libraries understand how to move forward, researchers now need practical guidance, and that is what our case study does for the first time. It presents a hypothetical similar to the one I described for you about Spain. In fact, this might even look like Spain. Basically, researchers at US institutions are planning to do cross-border text and data mining, studying discussions of the 2020 presidential election in the private sphere and the public sphere in a country called Floria. The private sphere spaces are social media posts in a Facebook group, and the public sphere spaces are journal articles or newspaper articles that are published abroad, and they're also going to be, the researchers are also going to be collaborating with researchers in Floria. So we go issue by issue in each of the four literacies, copyright contracts, privacy, and ethics, to address all of researchers' questions in an orderly fashion that will help them navigate these questions as they arise in their own projects. So for example, the first question that might come up for someone in the example I just gave is, if the copyright protected materials, those Facebook posts, those newspaper articles originated in Floria, does the foreign country's copyright law apply to the infringement analysis, and the answer is no, US fair use will apply to the research acts that are undertaken in the US. So we go through systematically and answer all of that for the researchers. I am going to turn it over to Thomas to conclude with our future directions for where we see this work going and the need that lies ahead. Great, thank you. This is gonna be really quick because I realize we're over time already. Sorry about that. So yes, in the course of this project, the project that preceded it, we engaged with humanists extensively, still interested in engaging with humanists, but of course many of you are probably thinking, this is a multidisciplinary sort of challenge, right? So I think moving forward, we'll wanna be thinking about how we might be supported or resourced to engage social science communities, science communities, and more. So that's kind of on the horizon. Rachel has already spoken to, I think quite comprehensively actually, the aligning campus resources to strategically address some of the policy and practice issues, say around ethics with the IRB being a potential partner in that particular work. And then in closing, when we submitted the proposal for this project, I guess that probably would have been in 2021 if it was supported in 2022. I don't quite have the timeline exactly right, but AI just wasn't present in the way that it's present currently. So we're thinking pretty hard about what the next iteration of this might look like in terms of a more explicit focus on the implications of AI in research as it relates to the work that we've done up to this point. Thank you all so much. I don't think we have time for Q and A. We started four minutes late, so you guys can get stuff to eat, but I think we have four minutes of Q and A. Okay, so maybe we can take a couple questions if they're brimming out there in the audience. So these are wonderful resources, and I'm curious if you've had conversations or feedback from legal compliance and privacy offices on campus, recognizing that these resources seem to be directed towards the scholars. That is a great question, and that is also embedded in the need to work with IRB. It's not just IRB, but it's also the Office of Legal Affairs or the equivalent on different campuses. Now, of course, they're categorically reluctant to take stands on things, and also they don't represent individual researchers. They represent the institution, which is why anytime there's an issue, the Office of Legal Affairs refers the researcher to our office to provide all of this guidance. So what really needs to happen is this kind of triangulation, which results in my ideal world in that kind of forward-thinking policy that I mentioned that says, look, here are what we think are best practices. If you do these, we will support you. Catherine Klosek, Director of Information Policy and Federal Relations at ARL, and just wanted to say thank you for this work. As you know, in 2001, the U.S. Copyright Office suggested that there actually isn't evidence from the field about the contract override problem that you described, and I think we now have a great deal of evidence, and I think it might be great to push for the office to do some kind of study or roundtable examining this issue with the evidence that you all have gathered and examples that others have in the room. So no question, just thanks. Thank you. Anything else? Well, thank you all so much. I hope you learned something, and do check out the case study if you're a practical person or the white paper if you're a theoretical person.