 All right, welcome back everybody. I'm delighted you all were able to stay on through the final session of the meeting. I think you'll find this to be quite an interesting panel picking up on some themes that seem to be running through many conversations that I've heard during the meeting. For those of you, I didn't see at the opening session yesterday and haven't had a chance to meet, I'm Cliff Lynch, I'm the director of CNI. I have a few closing housekeeping sorts of things and then we will move right onto the closing plenary panel which I will moderate hopefully with a pretty light hand and we are going to allow some generous time hopefully as part of that closing panel for questions and comments from those of you in the audience here because I think you may have some thoughts and questions about the matter at hand as well as our wonderful set of panelists but before I go there, I hope you've had a good meeting. I hope you've heard some things that were new, gotten some new perspectives on issues, made some new connections with colleagues, become aware of some things you didn't know about before and overall found this to be a stimulating and valuable day and a half with us. We will be over the course of the next few weeks making all of the presentation material that we can get available on our website. We will also be making the videos available. Usually we do the two plenaries first and then roll out the additional breakout project briefing sessions over the following couple of weeks. So you may get some of those before the holidays. You'll definitely get them in the near future. We will be sending you a meeting evaluation questionnaire by email so you can look forward to that cluttering up your email along with all the other things that show up about this time of year. I want to get us to offer a couple of rounds of thanks before we move on to the closing plenary. I've managed to catch a reasonable number of the project briefings, not all of them. I'm looking forward to some of those recordings myself. They've really been wonderful I think. And I think we owe a great bit of thanks to all of the presenters who shared their thinking and their expertise and their insights with us. And I'd like to call for a round of applause for all of our presenters. And I'd also like to just take a minute to thank the CNI team. They make these go smoothly and sometimes even make it look easy. Trust me, it's not. They are just so good at what they do. And I really want to thank them for all of their efforts in making this another successful meeting. So thank you. And that's about all the housekeeping I have. So let me get settled here and we'll move on to the closing panel. So let me just kind of introduce the topic in a very brief way. There's a little bit more elaborate statement in the program and I'm not gonna read it. What I'm gonna do is observe that the advocates for open science, for open scholarship have historically had a pretty clear vision. They have wanted the scholarly record to be rapidly and universally available for all of the people who wanted to read it and learn from it and build upon it. That was really a very fundamental goal of the open access and public access movements that have been underway for so long. It's one of the ideas at the heart of open scholarship. And they've pursued those objectives in a very complicated landscape of economic constraints, of legal constraints that are introduced by copyright, for example. And they have structured a series of choices and tactics and sort of ethical positions in many cases that work around and in many cases use that system to advance these goals of the universal ability to read, to learn, to build upon. And even beyond the human use of that scholarly record, I think historically there has been a high level of comfort with the computational use of that scholarly record. In part because the dream of universal access, really to a great extent, has been riding on the capabilities of technology, of computer networks, the ability to share digital copies of things at very low cost, all of the sorts of things that move us away from the historic constraints of shipping paper around, for example. The idea, for instance, of text mining of the scholarly record, starting with very simple things, keyword searching and then on to much, much more sophisticated text mining approaches. People were comfortable with that. People were actually, I would say, even generally comfortable with the computation on the scholarly record in the corpus, at the corpus level, doing things like tracing webs of citation analysis and understanding how the literature evolved through citations and various forms of content analysis. So I think historically there's been quite a good consensus around those things and those objectives. As we'll see, there are some boundary cases that are at the least troubling or generate some discomfort, but the broad consensus was kind of there. Now suddenly, in this last year or so, we see ourselves in a world where people are training these large language models that drive generative AI systems and that seem to have an insatiable appetite for material. And if it's high quality material, if it's vetted material, if it's accurate material, so much the better. That's really desirable content. People don't seem to be sure how they feel about this. It's a step farther than the kinds of uses that we've historically had this consensus that we're comfortable with. The current situation, I would say, is further confused by a bunch of legal ambiguity. For example, there are some people who have taken the position, I believe, that training AI systems on copyrighted material should be viewed as a fair use. There are other people who rather vehemently disagree with that position and there are various lawsuits starting to find their way through the legal system that may ultimately, in a period probably of some years, give us some clarity on that. But right now, we're trying to figure out what we want against a backdrop that is where the rules are somewhat ambiguous, which complicates, perhaps, tactics. These are the issues that I'm hoping that our panelists can explore today and that we can explore with you today. We touched on a few aspects of this briefly in the Q&A after the opening panel, particularly around things like the repurposing of art collections as training material. But I think our focus here is gonna be very, very strongly on the scholarly record, which is a particular and specific kind of issue. And I'm hoping we can not just get bogged down in the legal issues, but really think a bit about what kind of a world do we want? What are our goals? Does the consensus that I describe extend to enabling various kinds of AI and machine learning powered technologies? So that's most of what I have to say for the next hour as framing for this panel. I have a few questions I'll ask before we open it up. What I'd like to do, though, is to start by just inviting each of our panelists to introduce themselves. Talk a bit about where they're coming from on this because there are many perspectives on this, the perspectives of authors, of editors, of stewards of the cultural record, of librarians. Many of the panelists, I think, wear all of those hats at the same time, or at least more than one of those hats. So I wanna give them an opportunity to kind of talk about their perspective on this a bit. And just to be perverse, we've decided to go in alphabetical order by first name. So I'm going to invite Heather to start by introducing herself and where she's coming from. Thank you, Cliff. Hi, everyone, I'm Heather. I'm Heather Sardis. I'm the Associate Director for Technology and Strategic Planning with the MIT Libraries. And I'm really excited to be here today. So thank you, Cliff, for having me. I'm excited to share this panel with Rachel and Richard. I think they're doing brilliant work, and I'm so excited to hear their perspectives on the topic at hand. A little bit about my background. I'm a librarian, first and foremost, but I split my time as a librarian and a technologist. I come from the nonprofit world primarily and made my way to MIT out of a desire to work on issues of open scholarship and really apply technology and service of library aims. So although I'm very honored to be on the panel, I'm also sitting here extremely inspired by all of you, just spending the last two days hearing about the incredible work that is happening in so many different corners of the work that we share is really wonderful. So I just wanna give my gratitude back to everyone sitting here today. And I think, I guess, to answer your question, Cliff, about the perspective that I'm coming from, I would call myself an optimist when it comes to AI. I might not go quite so far as to label myself a techno optimist, but I'm definitely coming from a glass half full perspective. But I'm also a techno realist. I lead a technology team, we build a technology platform, we deal daily with the realities of maintaining large platforms of scholarly technology. So that's the perspective I'm hoping to bring to the conversation today. And I think that means I'll hand it over to you, Rachel. Great. Thank you also, as Heather said, for the opportunity to have this discussion. I'm Rachel Sandberg. I lead the Scholarly Communication slash eResource Licensing slash Information Policy Office at UC Berkeley Library. Among many other things, we train authors and scholars on what their rights are with respect to copyright and licensing and privacy and ethics. And I'm up here in part as a copyright lawyer and I suppose a token legal representative because these can be very challenging legal and ethical issues. And to help us avoid getting bogged down in the legal issues, as Cliff said, I'm going to take the liberty of giving you my thoughts for every single one of the six minutes, which Cliff generously allowed us to give opening remarks. So the law and policy landscape underpinning the use of AI models is complex and regulatory decision making in the copyright sphere will have ramifications for global enterprise, innovation and trade. The pending lawsuits that Cliff mentioned and a parallel inquiry from the Copyright Office raise important and timely questions, many of which we are only beginning to understand. But there are two precepts that I believe are clear and that bear upon the nonprofit education research and scholarship being undertaken by scholars who rely on AI models. First, training artificial intelligence is a fair use, particularly so in nonprofit educational and research contexts. And maintaining its continued treatment as fair use is essential to protecting research, including in the kinds of text and data mining research Cliff mentioned. Now, not all text and data mining research methodologies necessitate the usage of AI models. So the words of 20th century fiction authors used to describe happiness can be searched for with algorithms that look for synonyms of mirth or joy. But if you actually wanna find happy characters in literature, you need to train AI to recognize what a happy character looks like. Now this is non-generative AI. It's not creating the next pride and prejudice with happy characters in it. It's detecting the occurrence of happy characters. And this is a very common practice within text and data mining and has been non-controversial for years. Previous court cases have repeatedly concluded that reproduction of copyrighted works to create a corpus and to do text and data mining is a fair use and they further hold that making derived data or analysis from that research and from the corpora is also a fair use provided that those methodologies don't distribute or re-express the underlying works to the public in a way that could supplant the market for the originals. So for the same reasons that doing those processes in text and data mining is a fair use of copyrighted works, so is the training of artificial intelligence to undertake that text and data mining. In large part because of the same transformativeness of the purpose, which is fair use factor one. And because the training does not actually reproduce or re-communicate the original materials to the public, it also does not supplant the market for the original under fair use factor four. So there is no distinction to be made here from a copyright perspective on the nature of the work being used in training, whether it's art, scholarship, news or anything else. But there is an important distinction to make between training AI training inputs and AI outputs in the case of generative AI. The overall fair use of generative AI outputs cannot always be predicted in advance. The mechanics of generative AI models operations suggest that there are limited instances in which generative AI outputs could indeed be substantially similar to and potentially infringing of the underlying works being used for training. So while we know that the training of AI is fair, we don't necessarily know in some cases that the works created by generative AI, the outputs would be a fair use. Now I said I believe we know two precepts now and the second is that scholars ability to access the underlying copyright protected content to conduct this fair use AI training should be preserved with no opt outs from the perspective of copyright regulation. The fair use provision of the copyright act does not afford copyright owners a right to opt out of allowing other people to use their works in any other circumstances for good reason. If content creators were able to opt out of fair use, little content would be available freely to build upon, uniquely allowing fair use opt outs only in the context of AI training would be a particular threat for research and education because fair use in these contexts is already becoming an out of reach luxury even for the wealthiest institutions. And I'll explain what I mean. As some of you may know in the United States, the prospect of contractual override means that although fair use is statutorily provided for, private parties like publishers make contract around fair use by requiring libraries to negotiate for otherwise lawful activities such as conducting text and data mining or training AI for research. This landscape is particularly detrimental for text and data mining research methodologies because TDM or text and data mining often requires use of massive data sets with works from many publishers including copyright owners who cannot be identified or who are unwilling to grant such licenses. So if the copyright office or Congress were to enable rights holders to opt out of having their works fairly used for training AI, then academic institutions and scholars would face even greater hurdles in licensing content for research. Rights holders could opt out of allowing their works for AI training fair uses and then turn right back around and charge AI usage fees to scholars or libraries, essentially licensing back fair uses for research. These scenarios would impede scholarship by or for research teams who lack grant and institutional funds to cover these additional licensing expenses. It could penalize research in or about underfunded disciplines or geographical regions and result in bias as to the topics or regions that can be studied. Training AI is not without risk. I am not opposed to ethics and to address these risks, there should be the adoption of best practices, private ordering and other regulations governing issues like privacy, ethics, the rights of publicity which governs using people's voices, images or personas. So that would speak to Cliff's concern yesterday about reanimating your deceased relatives. And all of this should be done and particularly so in the commercial context. But from a copyright perspective, fair use and AI training should be preserved without the opportunity to opt out and even more so in the nonprofit, scholarly and educational context. Which brings me quickly to a third and final point I'd like to make, which is that merely preserving fair use rights for AI training is not the end of the story in protecting scholarly inquiry. So long as the United States permits contractual override of fair uses, we will continue to be at the mercy of publishers aggregating and controlling what may be done with a scholarly record, even if authors dedicate their content to the public domain or apply a Creative Commons license to it. So in my view, the real work that should be done is pursuing legislative or regulatory arrangements like the approximately 40 other countries that have curtailed the ability of contracts to abrogate fair use rights within the nonprofit, scholarly and educational context. I think this is our challenging but important mission. So I think you can hear a clear perspective there. Hi, well thanks very much Cliff for the invitation. I'm glad you chose alphabetical order and I'm very glad that Richard comes after Rachel because I was able to check what I'm going to say in case any of it was wrong. So my name's Richard Sever. I'm Assistant Director of Cold Spring Harbor Laboratory Press at Cold Spring Harbor Lab in New York and I'm co-founder of BioArchive and MetArchive which are pre-print servers for the life sciences and health sciences respectively. So we've over the past 10 years put a quarter of a million pre-prints online for hundreds of thousands of authors and obviously the aim is to get the information out as quickly as possible. It's free for authors and free for readers so the goal is that sort of the universal access that Cliff described and I'm increasingly inclined to use the word universal rather than open for reasons that might become apparent later. The goal of BioArchive and MetArchive is to speed up science by making these things free for everybody to read and as soon as possible because we think we'll get a geometric effect that will speed up science. As a consequence we kind of saw a bit of that in pandemic. Part of that goal is so that they can be available to be read by machines as well as people right away and we have a variety of tools to enable that including a big XML dump that's separate from the website expressfully for the purpose of text and data mining. We view that as falling under fair use in the United States and we make sure authors explicitly consent just to be sure because there are people who argue that case. We want to keep authors in control that is part of the goal is to be a service for authors and so we view AI use the non-generative forms as falling basically within text and data mining and so again under fair use but I do think it's important to be cautious about this and discuss it further because there have been well-intentioned movements in the past in the open space that have backfired and had some unintended consequences so we want to be a little bit wary of that and make sure that we don't pursue things that are not aligned with authors' interests and stop them sharing the first place. As far as what we've been doing one of the things we've done at BioArchive is dipping our toes in the water with AI by partnering with a startup out of University of Maryland called ScienceCast to create expert and lay summaries of text with the idea of increasing accessibility of what is after all incredibly arcane scientific information that you often need at least a PhD to understand and there's a lot of one of the things that it's clear AI can do very well is create lay summaries of text and we've been experimenting with that and we'll shortly be introducing ask a question of the paper type format that was mentioned in one of the earlier presentations but there's a bit of a cautionary tale there because lots of the people have applauded us for that but one or two authors didn't like the summaries that were generated. This was kind of a good teasing problem to have. It's good sometimes when you have these problems because it alerts you to things but they got very angry, they didn't feel it was an effective summary of their work. I should point out as an editor that I have seen on millions of occasions authors think that humor and summarizers of their works have not summarized them particularly well so it says something about authors there but it also says something about their concerns and I think we should have that in mind if you want to carry authors with you in that journey. Okay, so let me start with the first question and that is what in your view should the goal be here? Should it be to maximize the ability to train on and compute on scholarly corpora or should it be something else? I'd note that there are some interesting proposals floating around about what something else might be. For example, in November we had a letter to the editor in Nature, Human Behavior titled Generative AI poses ethical challenges for open science and at least one way to read the proposals in there is maybe we should be vetting the purposes to which the AI systems that are being trained on open papers might be put to before we agree to allow them to be trained. Now that at least to me sounds pretty complicated and maybe like a pretty deep rat hole but there certainly are people who have for example argued that training generative AI systems is a extractive process in a certain sense and particularly if this is going to be the territory of a few already very powerful and wealthy companies, maybe we shouldn't be feeding the machine. So I throw those alternatives out not so much to advocate for them but to say there does seem to be at least in the broad scholarly community some range of opinions there. So I wanna ask each of our panelists where do they see the ideal future state heading? Are we going alphabetically again? Oh, we'll go this way, this one and this way on the next question. So let's see, where to start? I come from an institution, MIT Libraries that has always been a very loud and a very vocal advocate for open access and I definitely draw inspiration and I learned so much from my colleagues that are doing this work, Chris Berg, Aaron Stahlberg who's somewhere in the audience and many, many others and throughout all of that work we've always known that open benefits everybody. It benefits scholars, it benefits people, it benefits corporations as well and it's not been our role to gatekeep how the information is being used, it's just to ensure that it's freely available. I mean, it's very similar to open source software. Like so much of industry is built on open source software and that fuels innovation and that's, in our opinion, a good thing. We can't gatekeep innovation and we can't claim to be the only holders of good intent so really we just focus on open and that's our starting point. But on top of that, I'm really with Rachel that regulation is a key and that's something that I think calls for all of our focus. We can look to the recently announced passing of the EU AI Act, which I think is really groundbreaking legislation that we would do well within the US to emulate, learn from and try to pass. And I also want to call out some good work that has been coming out of colleagues at MIT. They're also released just this week policy guidelines for pro worker AI. So it outlines a policy agenda for ensuring that AI is regulated in a way that really benefits workers and people and individuals and not just serving to further enrich already enriched corporations or individuals. Some of the policy points that they stress are ensuring that AI regulation limits surveillance of workers, ensuring that AI regulation, ensuring that we're investing in what they call human complementary technology. So not technology that replaces human workers but technology that really augments and collaborates with humans to increase our capacity for creativity and increase the potential that we hold to expand human knowledge. And then finally, technology regulation that prioritizes social wellbeing. So how can we regulate the use of AI in a way that contributes to the public good and contributes to social wellbeing? So I encourage folks to look up that policy recommendation. I believe if you just Google pro worker MIT AI policy, something like that, you'll get it. You can also email me, I'll send it to you. And then finally, I also wanted to highlight something that came up actually in the previous session, the NSF session. There's a lot of really incredible work coming out of the Global Indigenous Data Alliance around care principles for open data. I know that we all have deep familiarity with fair principles, but I think care deserves our attention in a really profound way. If you're not already familiar with care, it stands for collective benefit authority to control responsibility and ethics. And it's a framework that's intended to ensure that data is being used by anyone, corporations, institutions, et cetera, used in a way that benefits the people who contributed to the creation of that data. So I think if there are two things I'd wanna identify as goals, in addition to all of the incredible work that is already happening, it's uplifting care principles, uplifting the work of Indigenous Data Scholars, and really focusing on pro-worker and human-centric AI regulation. I am in the good fortune of getting to agree with Heather on everything so we can just partition the room. So I completely agree that scholars need an open environment for what they wanna study, and it's not our role to gatekeep what and how scholars study. We should also just acknowledge that people's works are already being used in ways that they cannot control. That is part of creating something. People are already studying things and having really bad takes and analysis with or without AI. Teachers get up all the time in classrooms and say really dumb things to students and perpetuate misinformation. Yes, AI makes that more at scale, but when something is out in the world, it's really difficult to control how it's used, and merely slapping a restriction on the scholarly record doesn't mean that companies can't or won't use it. But although we can't gatekeep, there are a couple of approaches. Honestly, it's what Heather has said that could be considered within nonprofit education and research. And to add a little bit of a framework to what she said, I would call that private ordering in the context of MIT's pro-worker policy and regulation. An example of private ordering like MIT's policy is also what we do at the UC Berkeley Library under what are called our Responsible Access Workflows. We created these to determine what within our collections, we actually are comfortable both from a legal perspective with copyright and contracts and privacy and from an ethical perspective in digitizing and making available to the world. Our ethical principles actually apply at Ethics of Care. I swear Heather and I did not talk about this beforehand, but you can take a look at our ethics policies online. Again, this is private ordering. We as a library are deciding what we are going to make easier to use for the world. Now, we have all of these materials in our collections. It is up to scholars to decide what they want to do with them. We're not controlling their access to it, but we as an institution can make decisions about how we want to use those materials. And these principles have turned out to be very readily replicable for the colleagues that we teach in journalism who are doing interviews or anyone doing ethnographic studies because they are practical ways to bring in ethical considerations that are not mandated by law, but give a voice to the content creators who may not have intended their work to be used in the way that researchers wanted to. And of course, regulations. So Heather mentioned yesterday's or Friday's the passage of the EU regulations and I just want to say one word about that. Those regulations on what needs to be disclosed regarding LLMs do not apply to research organizations and cultural heritage institutions because those are governed by article three of the digital single market. Those entities can pursue text and data mining and now AI training without contractual override and without these limitations. But in the commercial sphere, rights holders can opt out of having their works used for text and data mining and now it looks like also opt out of having their works used for AI training because they're governed by article four of the digital single market. So the US, as Heather mentioned, can decide whether it wants to follow EU's lead here. So I'm going to be really controversial and just agree with everything that was just said. But so I won't talk for very long but there are a few things that spring to my mind. I mean, obviously I do think we want to maximize the potential for scholarly content to be training sets for AI. I'd much rather that scholarly content rather than the New York Post was the basis for the answers I get in chat GBT. But I think there's a couple of analogies that sprung to mind when I thought about this question about things that are inevitable. I mean, it feels like it's inevitable. Google won the internet, AOL didn't. So trying to kind of close off things and say that you can't do stuff with them just seems like naive. So I think it's going in that direction anyway but I think this is where regulation comes in and one has to think pretty carefully about that. I mean, I know in the gene sequencing world it's interesting when people think about, there's lots of kind of concerns about privacy about your genetic sequence and you can get it from anybody's hair. So it's kind of like, the conversation should really be downstream about what can people do with it accepting that the upstream stuff is inevitable. And this feels kind of similar. So it feels like the regulations are needed at some level and Rachel can obviously speak in a much more informed way about that than I can. But one thing that struck me is the for equity and ethical approaches to this. I think we really ought to think about infrastructure as well. There was an article by Sylvie Delacroix from the Birmingham Law School in the UK when she said the right analogy for thinking about a lot of stuff to do with AI is water rights. And you don't want the system of water rights in the country is very well thought out so that the person who sort of builds upstream of everybody else doesn't get everything. So it's not really kind of about a kind of like sort of easy, open, free, completely democratic access. It's thinking about the infrastructure that you need to build that can ensure that this can't just be taken over by one group who has the most amount of money. I mean, I think this is kind of what Rachel was getting at earlier. And I remember a conversation with a colleague of mine and one of the partners we'd been working on AI with and one of my colleagues said to the developer, oh, you know, so did you build this AI tool, the training set? And the response from the developer was like, of course we didn't. Only billion dollar companies can do this. So we don't want to have a situation where only billion dollar companies can do this and anybody who wants to use tools is critically dependent on them. Thank you all. Those are really interesting responses. And I think one thing I want to underscore here is that all of these responses remind us that controlling the activity of training and what can be used for training is far from the only lever in the world of trying to put these systems to good use and good purpose. There is a whole lot of regulatory approach that can be taken quite independent of these questions about copyright and training data, which I think is a really important sort of meta point out of all three responses there. Let me move on to the second question. So should we be thinking about the scholarly literature differently than the broader sphere of published works, things like novels, mass market books, entertainment. If we do that, how do we think about where to draw the lines? And perhaps you may have a particularly interesting view on this, Richard, as the founder of a set of preprint archives. Should we be thinking differently about different parts of the sphere of scholarly literature? For example, peer reviewed material versus un-peer reviewed preprints that may under peer review turn out to be inaccurate. Do we really want that in the training sets? Why don't you start on that set of questions? So yeah, the short answer that I do think we should be thinking about it, but it's really, really hard. I think that somebody in one of the earlier sessions was talking about, I think it was semantic scholar, and it was trained on scholarly content and the phrase was used however you define that. And really I have no idea. You know, I think the one thing that I would say sort of right from the get go is the idea that you would distinguish between non-peer reviewed and peer reviewed content is really, it is deluded because the allegedly peer reviewed content is so polluted with things that are worse than the non-peer reviewed content that you have problems from, you know, I mean, we went through this with Med Archive and Bio Archive with COVID and my sort of flippant remark when people said, oh, there might be some misinformation on Med Archive was to say that the papers that said COVID came from outer space and from snakes weren't on Bio Archive, they were in peer reviewed journals. So I mean, obviously that's a bit of a joke, but it's a continuum rather than, I mean, everybody knows about predatory journals, but one of the real problems is the sort of quasi, are they predatory or not and people have different answers, you know, delisting from Web of Science. So, you know, I kind of defy anybody to be able to draw that distinction and if you then say, well, then you're gonna allow preprints in, then which preprints, you know, is it things like, is it Archive and Bio Archive? Do you have, I don't know if anybody's ever looked at the VIXRA preprints server, which is Archive backwards, it's basically all the lunacy that Archive don't let in. Somebody says that, you know, there's a horrible cartel of physicists running Archive, so you can put your stuff here and it really is lunacy. There are data repositories that take preprints like Zanodo, if you go and do a search for things like the origins of COVID on Zanodo, you get some really, really weird papers. So, you know, I don't really have any answers. I mean, in the medical sphere, I guess the only thing that you could say that comes close to that kind of corpus would be Medline and I'm being very careful to say Medline, not PubMed or PubMed Central because there is a barrier to entry with really in-depth scrutiny to get into Medline so it's a much smaller subset of journals. Maybe I think it's about 15% takes years to get in now, so it's hardly sort of rapid. So, I think we have to think about it but I really don't have any easy answers. I mean, I guess maybe the answer would come from the people who build the models and can they do things like look at network effects to define a core and eliminate outliers? On this one, I mostly agree with you. I would say no, we shouldn't be thinking differently in terms between the peer reviewed and non-peer reviewed in terms of training content. The restrictive licensing push already makes it difficult enough to access and use content for training because of contractual override. I think making a distinction would undermine public interest goals pursued by lawmakers. It would also create a further risk of rent seeking and anti-competitive behavior if a rights holder were able to demand either additional remuneration to have their content included or withhold granting licenses for their content to be included. So, I think platforms as Richard said should have good policy and they should have transparency and the Federal Trade Commission could regulate them the same way that Europe is doing with requiring or will be doing with requiring disclosures especially for commercial LLM platforms but this is not a reason to prohibit ingestion of the content for AI just regulating how the tool is used or what is disclosed about it. People will be able to then make their own determinations about the utility of the tool based on those disclosures or evaluate whether it's garbage in, garbage out. So, calling back to Cliff's question should we be thinking differently about the scholar literature compared to published works, et cetera? From a purely technical perspective, not really. I mean, there are some distinctions when it comes to different data types if we're talking about massive scale data sets versus text versus et cetera. I think where we're at right now when you talk to folks in industry who are building large language models they are just scraping everything they can get their hands on indiscriminately. It might be scholarly literature, it might be the New York Post, it might be anything because we're in such a hype cycle right now so everybody just wants to build, build, build and consume, consume, consume. And so, I feel like one of the important questions for all of us is how can we make it as easy as possible to access good open scholarly content because there's a desire for that. We've spoken to research teams at MIT, we've spoken to folks who have spun off startups from MIT and that is what they want but it's difficult to get it or it's more onerous than just scraping up whatever can be accessed on the internet and that's the cause of a lot of the problems that we're seeing especially in generative AI models right now is that garbage in, garbage out and right now garbage is really easy to get and so how can we think about the ways to make it easier to quickly and efficiently consume not just one repositories scholarly content not just a consortium of scholarly content but the scholarly record and that's I think a limiting factor right now that we in this room are particularly well positioned to tackle. So that's a fascinating set of responses, thank you. I was going to ask about distinguishing text mining from AI training but I think actually we've pretty much covered that issue already. So I'd like to jump maybe to the last area that I wanted to ask about before opening it up to the audience and that's really about pragmatics and scale. We've already touched on a couple of pieces of this at least implicitly for example what you were just saying about making it easy to get good quality material at scale. I wonder if I could get each of you to reflect a little bit about the kind of infrastructure we need, the kind of choices we should be making to allow machine learning use of the scholarly record to scale effectively and operate as broadly as possible. For example, what can we do to make it really easy to get large amounts of high quality text and to be for example transparent about what was in that training set that was used in order to train a given model. Why don't we start with Rachel this time and then we'll go around this way. Sure. So I'll just address the kind of rights associated aspects of that question in terms of infrastructure. What we've heard over the course of this conference is some folks considering well perhaps we need a new Creative Commons license to facilitate better training and also better attribution to Cliff's question right now. So I'll say first that Creative Commons licenses do not override uses and exceptions like fair use. So when you are making a fair use and training AI you don't have to worry about the Creative Commons license anyway so we don't have to make changes to the Creative Commons licenses in order to promote fair use because fair use exists beyond the license anyway. In terms of questions around attribution and disclosure of the content being used I think from an infrastructural perspective we could think as the EU has done from a regulatory perspective at least for commercial entities disclosing more about the content but we should already feel some comfort in the nonprofit educational sphere that Creative Commons licenses already afford a degree of luxury not luxury but a reasonableness in how we identify the works that are being used from an attribution perspective. They allow attribution in any manner that's reasonable and maybe that is a separate text file that lists the sources but it doesn't have to be an identification on every single item that's feeding the training tool. I actually think that knowing that because I'm not advocating for changes to Creative Commons as the infrastructure where I do see a role especially for libraries in a facilitative infrastructure is around education. Right now we are in a situation in which the onus is on researchers to determine whether or not what they wanna do is lawful or ethical and we know with certainty that that uncertainty alone prohibits their research. Some of you who might have come to our project update yesterday on the building legal literacies for text data mining and the building legal literacies for text data mining cross-border projects have confirmed that all of that uncertainty because it's not part of training and education for researchers is impeding their research and in many cases preventing it entirely. So as libraries our role is to educate about the current legal and regulatory landscapes and help them make choices that work for them but our role is also to do the advocacy work and ensuring that we get good laws and regulations that support the kind of research that they wanna do. And I think one critical way we can do that is by working with our offices of legal affairs and our institutional review boards to start developing university or institutional policies that will defend researchers who make a good faith effort to comply with various copyright or privacy laws and ethical policies in their research because that takes the pressure off of the scholars if they understand well look if I do the things that the law requires or that my institution expects they've got my back and I can move forward with this research I think that will support more learning, understanding and a better scholarly record. I'll go to Richard next. So yeah I would I mean I would really like to echo that point about fair use and the fact that this is achievable with fair use. I was kind of struck by something that the journalist Richard Poinder who's been covering the OA space recently said in his decision to no longer cover it because he's claimed the movement had failed and one can argue about everything he said but he said one thing that was interesting was that he felt that the focus on CC by had been a misstep and it's interesting to me and I feel that there is quite a lot of zealotry around that and it's a bit of a shame when that is at the expense of fair use because one of the things about fair use is it can be applied to existing published copyright content which is never going to be CC by. So if you can train AI on all that content that's never going to be CC by on the basis of fair use then that seems like a really big win and so I think there's some education that needs to be done there and we really should push in the direction of having people understand fair use and also outside the US there are jurisdictions where people say oh well that's only in the US it doesn't apply in our country and so you know there are more global efforts on that front that would be great. So yeah I think that's, I think coupled with that if we can build places where there is easily mineable content so archive and bio archive and med archive all have text and data mining repositories and that content falls under a wide variety of licenses I think on archive I'm nothing to do with archive but I think 99% of that content is under the archive license which is far more restrictive than CC by but archive are at pains to point out nevertheless it is all available for text and data mining and you know AI training the same would be true of bio archive where only about 20% of authors choose CC by I think more of them like things like CC by and see but again we say 100% of the content can be used for this type of training. If you've the last word on this Heather. Thinking from the infrastructure perspective it's really a question of scale I mean it's the classic question of scaling up and I think that's the that's the space that we're in right now and we're already doing so many things right I mean it's like I had said in the intro there's so many works and projects being presented over the course of these two days that are about partnership and collaboration and large scale access to shared resources and that's exactly the sweet spot and I think when it comes to artificial intelligence it's an exponential jump in scale and so I would encourage all of us to also look towards partnerships with the research computing units on our campuses with the folks who are doing high performance computing and are really in the guts of the infrastructure that enables data processing to happen at the speed of artificial intelligence and to not lose sight of our strengths particularly as librarians to facilitate and nurture and grow those types of connections I think that's one of our greatest strengths as a profession that we're really focused on enabling access to shared information building connections across many different institutions and organizations and really issuing competition that's an area where we have a leg up and else did I jot down while everybody was responding I mean something that always sticks with me is a point that was made in the book Power and Progress a thousand years of technological innovation it just came out a couple of months ago and the authors look to the industrial revolution to understand the scale of technological revolution that we're currently in that it's not a question of text and data mining to scraping that we're looking at a multi-decade transformation on the scale of the industrial revolution and that that puts us into this response mode if that makes sense where we're seeing a lot of corporatization of information I know that a lot of folks in the room share concerns about that and that is an issue that takes several decades to respond to and so it's really worth a read Power and Progress because they look at the labor movement as a response to the industrial revolution to map out our response to this information revolution that we're currently in the process of undergoing and I think that that's another really critical role that we have to play as folks in the information ecosystem just thinking again about how we can advocate not only to scale up, not only to meet the demand not only to keep pace with development and access to information but really keep the focus on the human at the center of AI and the human at the center of data and advocate for systems, applications, policies and governance that ensures that this technology contributes to the public good and then I'll close with a nod to another colleague at MIT Josh Bennett who I just got to hear speak last week at an AI and creativity summit and I was really inspired by what he had to share he talked about looking at prior art to inform how we interact with AI and how we respond to all the issues and complications that we talked about today and he held up sampling in rap music as prior art that we can look to where depending on how it's done it can come across as homage it can come across as ripping off somebody's prior work and how do we look at the ways that sampling is successful that sampling is a celebration that sampling is an homage of all of the different work that is synthesized together to create a new creative work and so Josh Bennett citing him for that I think that's what I'll leave everybody with. Thank you for all of those insights, wonderful. At this point we have a little over 10 minutes I'd like to open the floor to questions and observations from our audience here. I believe there are microphones there and there, please. Thank you Cliff and all of the panelists Lisa Hinchlow of University of Illinois at Urbana-Champaign. We began with I think a really provocative framing from you Cliff that in spite of our seeming like collective view that this is fair use and to train AI models we actually know there's quite a few authors who are not feeling that way and our filing lawsuits so far unsuccessful but Rachel I was particularly taken that you sort of were very careful to set aside the generative AI and particularly outputs and whether those could be infringing and so I realized at this point one would be speculating on the ways that might be seen but if so first might outputs be infringing. Secondly with respect to the market it's the market of prior work, the inputs not the sort of artist's ability if you will to make a living from future works. If you can unpack that a little I could be wrong on this and then with that would there be any liability for us if we were making content available where the outputs would be infringing. Thank you. Might Lisa hang out up there because I might you asked three questions so I might need you to repeat the third. No, no, that's okay. Okay, first question inputs versus outputs. While it may be speculative I don't know any legal scholar I respect who would think that courts would find the training of AI to be anything other than a fair use whether it is generative or non-generative because of the transformativeness under factor one and the non-expressive use for replacing the market and non-communicating the original to the public under factor four. Could something in a generative context create an output that infringes? Yes, potentially that would be determined on a case by case basis and it would be more as I highly encourage everyone to read a couple of recent papers by Matt Sag. He's at Emory. He explains the kinds of context in which that could be more likely and the kinds of best practices that we could adopt to avoid that and those would be in situations where a training set includes multiple like many copies of the same item. In terms of what market is considered for purposes of factor four it is the market for the individual work at issue not the creator's entire body of work or their livelihood or things they create in the future but does it replace the market or supplant the market for that individual work and in a non-generative context it's hard to imagine how that could even be possible. In a generative context it could be possible on a case by case basis while balancing the four fair use factors. Which leads to if it could be then is there potential liability if one's made those materials available? If what you are creating if what you are distributing is the underlying work or something that reproduces the underlying work potentially but in a non-generative context I see no real opportunity for that. Thank you, I realize these are really complex issues and I have as a human intelligence no ability to attribute all the things that have gone into my asking those questions. Can I just ask is would you think that in that scenario with the generative AI that authors might reach for CC by ND licenses to try and head that off? They might but that would be indicative of the lack of education that a CC by NC ND license can actually prohibit the content being used for training because... No, I just meant for the generative aspect. For the output. For that individual output. Oh, with the... Okay, so I mean potentially, yeah. So I'm sorry, Steve Weida Yale University. I woke up last night in a sweat in my hotel room and I hope it's not just because of much too much dime a tap but I had three letters go through my head that I think everybody used to be really hot to trot with SEO, the search engine optimization. And you've kind of touched on this but I don't really hear anybody talking in mass about GPT optimization. How do you float your results and get your content noticed by the GPT and the large language models to come up? And I mean clearly there's a commercial aspect of this too. Sponsored results. Hey, I'm hungry tonight. Wouldn't you like a fine McDonald's hamburger? You know, and so I'm just wondering if there's been any thought around how to make your content, any content more desirable and more usable for the search engine so that it's most likely to go to it. Because I don't think anybody eventually is going to care that they're the number one Google result. They're gonna care that they're the number one GPT response. I kind of have a follow-up question to that because it sort of depends on the technology to me but with SEO, as I understand what you're trying to do is guess what the Google algorithm does, not knowing but having quite a good idea of what it is because you know it's an algorithm that's been designed by a Google engineer. But with lots of AI, even they don't know what's under the hood. So how, I mean for somebody who's more technical than me, how doable is it if even the designer doesn't know how it's optimizing? I'll just note that there is actually a vast literature on adversarial machine learning and basically how to pollute various models with small amounts of mislabeled training data, things of that nature. Now that's not quite the same thing as search engine optimization in particular because SEO is trying to manipulate where you show in a ranking typically. And if you look at a lot of these chat GPT type things, they're actually doing sort of stochastic generation. So that runs kind of counter to the where you show in the ranking. So that's a place you might at least start looking. I think you may have named a new discipline of marketing that's going to employ a whole host of people over the next couple of decades. So in the introduction and talking a little bit about sort of comfort consensus and what we have, we have consensus about, we feel comfortable about these things. I may not feel comfortable about those things. I was thinking that we didn't feel comfortable about the first set of things at the beginning of those endeavors either. Like we have gotten comfortable over a period of time. And I still, you know, still in our MIT advocacy for open, I still talk to plenty of people who are not supportive of open in general, even before generative AI. So I'm thinking about this sort of in the change management context too. And I know because I work with Heather that she's also done a lot of thinking about technological transformation and what people have worried about in the past when other technologies have been put out there in the past. With every technology change, people worry about their jobs. We all, that's a pretty consistent through line. And so not surprising that we see that now in the sense of generative AI. So I'm wondering if, I actually just want to recall really quickly to a really excellent question Lisa asked yesterday about like what do we know about trust and how, or let us to think about what do we know about trust and what helps people trust. And so today I'm thinking what helps people have comfort. So I'm wondering if you all could speculate a little bit on what attributes would need to be in place for people to feel more comfortable with generative AI or if that might be hard or to speculate on sort of what other disciplines we need to be talking to to understand and experts in other disciplines to be thinking about what contributes to trust, what contributes to technology, comfort and how we think about this a little bit as a change management trajectory. Happy to jump in first. Like I said, at the outset, I'm an optimist. I'm a glass half full person. So the first word that comes to mind is experimentation. Try out the tools experiment. That's one of the beauties of the new crop of generative AI tools is that they're very accessible to all of us and they have essentially unlimited application. I was joking, Erin and I went to the AI and Creativity Summit in MIT I think last week or the week before and there was a student lightning talk session where they were talking about the different emerging applications of AI. And I left thinking, okay, the question is not how can we apply generative AI to libraries because the answer is everything. Every single application that you can think of is currently being developed and in progress. Everything from using voice activation to speak objects into existence through technology that takes your words and translates it into commands for 3D printers. So you had people saying, make a mug for me and then it would print a mug. Erin, I think the quote I wrote down from one of the students was, he said, can we bake bread and a shoe? And it turns out using AI they could. I'm not even sure what that means. So I think given where we're at in this particular transformative moment, experimenting, seeing what these technologies can do and thinking not how can this replace the work that I am doing but asking how can this enable me to do more or different or more creative work. I think a lot of answers live in there and I've seen it in myself, in colleagues, in others that once you start playing with these tools and thinking of it as play and thinking of it as creativity, they start to become so much more accessible. Yeah, I mean, I would add that. I think the key word there is tool and you know, academics used, scientists, my experience, scientists use tools and it's a tool that you use is different from a tool that is used to replace you. I mean, I'm conscious of the fact that I had had a conversation with a friend of mine who runs a music production company and you know, he's like at that age where he's thinking he's gonna quit because he's basically like nobody's gonna pay me to make music anymore for TV shows because they'll just use AI. That's a very different conversation to the scientists. I know who are now using AI in the lab. They see it as another tool set. It's something that you move on people, you know, in structural biology, there's lots of people who used to do x-ray crystallography and they now do cryoelectron microscopy. It's not that they're getting fired and nobody wants, it's like everybody who used to do one is doing the other. So I think it's, you know, it's a good position to be in, to be a scientist because it's a tool you use rather than the tool that replaces you. So the difference in our conversation here is very different to the conversation that was going on among writers in Hollywood or in the music industry. That's a really, that's a really good insight. Thank you. Did you want a final word on that? Yeah, it just wouldn't be fair otherwise. Um, okay, so I'm gonna focus on the aspect of the question around trust and how can we build trust? And I think they're tying into what Heather and I were saying on sort of the private ordering policies. If you have transparency on your policies, then that can give various community members, that can help build trust with community members. So for example, I mentioned our responsible access workflows. We also have what's called a community engagement policy which says, hey, these were, these are how we made all of these decisions. If you have questions about it or you think we made the wrong decision, knowing what you now know about how we went about the process, reach out to us and we can talk about it. If you have more information, we can reapply our policy and see if we reach a different outcome. But it's that kind of transparency in our policies that makes that trust possible. I would also just say, I would take a harder line with respect to copyright. At this point, there is no new, as far as I believe there is no new law that's needed to placate the concerns of content creators because the substantial similarity test for infringement, which would govern the output of generative AI, covers their concerns. I mean, that provides if there's infringement, if there's substantial similarity and there's no excuse or reason why that infringement should continue to occur, then there's the potential for damages. But I would say that regulation of other aspects, non-copyright aspects, can also help build trust. So ensuring that we do have good regulations around privacy and how materials can be used, or at least from a commercial perspective, as is being done with the EU, disclosure of the materials that are going in and other things that are other considerations that the commercial entities have put in. Thank you. Good points, Paul. We are, alas, out of time. We may be out of questions as well, so perhaps that's a happy confluence. Please join me in thanking this wonderful set of panelists. Our December 2023 meeting is closed. I wish you good holidays, safe travels home, and I look forward to seeing you in 24. Thank you for joining us.