 Hello everybody. Good morning Wikimania. I'm Sean and this is a panel about AI and challenges and opportunities that that presents to the movement and particularly specific projects that the foundation chapters and others are conducting so practical kinds of concerns. There is going to be a panel in about an hour or so, an hour and 15 minutes where they'll be talking about sort of more philosophical and kind of vision ideas around what AI means more broadly. This will be sort of more specific panel about what it is that's actually going on on the ground in terms of products and features and concerns. So I have three great panelists. You all can introduce yourselves maybe. Hi, I'm Daniel Ranicic. I'm a Wikipedia for 20 years now and I have worked on Wikidata previously and now working on Wikifunctions and abstract Wikipedia. Hi, I'm Lydia Prinscher. I'm the portfolio lead for Wikidata and have been with Wikimedia and specifically Wikidata for 11 years now. My name is Layla Azia. I'm the head of research at the Wikimedia Foundation. For those of you who may not know about the research team, it's the team that focuses on building models and insights, utilizing scientific methods and also strengthening the research community around Wikimedia projects. I bring to this panel some of the perspectives anchored in the work of the team with regards to generative AI and large language models as well as some of the work that we are doing with the neighboring teams in the Wikimedia Foundation and share with you some of the perspectives and work that is happening at the moment. And I am Sean Spaulding. I'm lead counsel at the Wikimedia Foundation. So on the legal team, I interface a lot with future audiences team that's thinking about AI, I do some product counseling with other teams. And so that's the perspective that I think that so many of the topics that we're going to talk about, we could talk about for an entire hour. So I just wanted to let you know in advance we're going to be making a handful of simplifications. So when it's not necessarily always going to be straightforward whether or not we're talking about machine learning or other concepts like that, but we're going to use that term. And one more thing, we don't represent everything that's going on in the movement. So if anyone is specifically working on anything, then feel free to basically raise your hand and talk about it because our perspective is just one of many perspectives going on here. Finally, you all and many of you came here from a long ways away. And so we don't want this to be a lecture necessarily. So as soon as anybody has a set of questions, feel free to raise your hand and present those questions so we can interact together. Alright, so there is one kind of preamble that I'd like to give, which is talking about how the use of AI is not necessarily new to the movement, new to the foundation or its products. And so let's talk about that. We've been using box since the beginning. There's been a machine learning team since 2017 at the foundation doing things like machine translation. And so maybe everyone can start out by saying, how have these tools that we've been creating evolved over time? And how have they currently sort of met the project's needs so far? Sure, I can start. I'll maybe say that if you're interested in the topic of getting an overview of the many tools and technologies that are currently being developed, I encourage you to attend a session at 12.15 in the Plenary Hall. Santosh who is sitting in the back of the room, he's or middle of the room, sorry. He's running that session so you will get a deeper view into what's happening on the projects with regards to machine learning and AI. I can share a few maybe examples of things that have been happening in the Wikimedia Foundation for those of you who are interested in bringing newcomers, new editors to the movement. Some of you are familiar with the newcomer dashboard. It's a dashboard that is developed by the growth team at the Wikimedia Foundation and part of the dashboard centers around this idea that we should engage newcomers by giving them structured tasks because we are sometimes giving them too many things at once and it's not clear to them what they should be working on. So the growth team worked with the machine learning team and the research team at the foundation and we developed a machine learning model which is called add a link giving an existing Wikimedia article the technology or the algorithm recommends to you what hyperlink can be added to an already existing article. So it's an example of machine learning being used at the foundation and the products and features that some of the users interact with. The feature is available now in more than 100 languages and you know there's more work being done and making the model more language agnostic and this is something that I want to emphasize on. I think we are going as an organization through some level of transformation as we realize what are the constraints that we are working with. We used to develop machine learning models for specific languages and now we are realizing that we have 300 plus languages just in Wikipedia and there are other Wikimedia projects that we serve so we are moving more and more towards where it makes sense language agnostic models that allows us to go to many more languages at once. There are other examples like revision scoring and quality scoring of the articles. This is a piece of technology that was developed in 2017 and 2018 under the umbrella of oris. We are revisiting that technology right now because we understand that we need to go to many more languages much faster than what we are going to and the process is slow because we need to build training models in every language that we go to so again for that piece of technology we are looking at language agnostic models so there is a model right now that does revision scoring in 47 languages and we are testing in more languages so a lot of the evolution right now or part of the evolution is around thinking about language agnostic models in ways that we can scale to more languages more quickly. So great example anyone else have any examples of current kind of uses of AI? One thing that I want to mention is what Sean already said that we've been using technology that this AI likes since the very beginning. One funny thing about the term AI is that AI is usually whatever we don't really understand right now and for example technology like using templates with places to put in words things that were called mail merge but totally developed within AI projects back then and we would never think of AI today as being that but this is how for example Rambod took the US census data and put the city data in 2003 or whatever into Wikipedia and this is like 20 years ago. So we've been using AI results since basically the very beginning in Wikipedia and today's AI is a continuation of that and I'm pretty sure that we will absorb this just as well as we did before and we have what I think is much more important is that we can do so without losing sight of what's really the valuable part of Wikipedia what is the special thing that we do and we're going to have a more philosophical session later but just want to point out this is nothing that is like entirely new to us we have been dealing with new technologies over the last two decades all the time. Okay so great examples of stuff that's currently been done historically but I think everybody's here because they understand that AI now means large language models and diffusion models and so what is being currently proposed or already worked on that might be using those techniques? Maybe I can start. For Wikipedia there are several ideas that people are throwing around and some of them people are actually working on so for example when you want to use the data in Wikipedia and really explore it you probably need to know sparkle to write a query that is too hard for a lot of people and there's a barrier that maybe shouldn't exist so people have been playing around with training a large language model to take a natural language prompt like give me the ten biggest cities and translate that into a sparkle query. Results look promising but not quite there yet. Another thing I'm very excited about but that unfortunately the people haven't started working on yet if you want to talk to me is using a large language model to basically take some data from Wikipedia like the date of death of a recently deceased person. We have a reference to an obituary in the New York Times for example and then we could use it in a large language model to prompt it and ask like does this reference we have here actually say that this person died on that date or is that maybe not the case and someone should look at it and correct the reference? One place we're looking and the abstract Wikipedia project is so in abstract Wikipedia you will need to construct the articles in a natural language independent way which requires probably quite some involvement learning like to each constructors idea how to pull them together and so on and this is a place where already since the project has been proposed three years ago we were thinking that we can use an LLM to the word LLM that makes this big name but we can use a language model to basically look at natural language input and try to create a constructors itself basically translate it into that so that it gets your head start into building the content of abstract Wikipedia and helping you there but leaving you in complete control of the content because you are the one who actually then says yes this is right or I can go in and add a bit more. Maybe I'll start by sharing what if Mariana Pinchuk was with us so she's on the panel but she couldn't connect because it's too late in San Francisco so I'll speak a little bit to some of the initiatives that they're running in the foundation first so the foundation decided to invest on what is called future audiences as part of the annual plan you may have read this as part of the annual plan of the organization there I must emphasize that there are a lot of discussions happening right now so it is by no means we can claim that you know we know the answer to everything but we it is clear to us that we need to continue learning listening and experimenting with the communities for the communities and also for the audiences that may not be here today with us on the Wikimedia projects or we may lose them in the coming years so the future audiences initiatives is partly focusing on initiatives that is going to focus on bringing new users to the Wikimedia projects and one of the things that you may have read about is the Wikipedia chat GPT plugin it's a plugin that you can access if you go to chat GPT and what the plugin does is effectively constraints the chat GPT space and it basically lets you interact with chat GPT but only with the Wikipedia data it's almost like a plugin that that gets that can tell you what is Wikipedia's perspective on the question that you as a user are asking if you have not experimented with it I encourage you to have a look at it but that's one of the examples of things that the foundation has done over the past couple of months to make sure that we continue evolving and learning and experimenting and now we're looking at the data to understand how and if we are engaging with new audiences in that space another relatively large part of our attention in this coming year is going to be on the topic of text summarization we already started with some of the M-BART and machine learning models around text summarization text summarization is a piece of technology that can if we can get good at it can have a lot of different applications those of you who were in Wikimania in Stockholm we had a community discussion around managing disinformation and mitigating disinformation on the projects and one of the requests that came from some of you who attended that session was that the discussions around perennial sources go for decades and it's very hard for someone who's entering the discussion on perennial sources sometimes to understand what are the important pieces of information that they need to know from the past two decades of discussions on perennial sources of course editors right now spend time manually summarizing content but those are the type of applications that we are thinking about which is where are the places where we can support editors in doing the work that they're doing but removing some of the burdens that they're going through and of course text summarization can have other applications right you can think about writing the pumps for the articles you can think about other ways of managing and mitigating disinformation and misinformation but that's an area that we are investing research and experimental development resources on to understand how we can use text summarization and I will say the key for text summarization is not the technology but finding the right applications where it is going to be actually useful for people who are on the ground so it seems like there's so wide of a range so we talked about sparkle and interacting with wiki data and wiki functions we talked about text summarization the chat gpp plugin in the universe of every single thing that we could possibly work on why are these the things that we've chosen to focus on because we want to really focus on what we as the foundation or the chapters can do basically to help the contributors to help human writing wiki data we are completely aware that we are just a little part of the wider research world and of the wider protocol there will be a lot of products built on top of wiki data on top of our projects outside of what we are doing and for some of those areas like for example for the future audiences projects with chat gpp and so on we are diving in because we want to be able to like see what's going to give a bit of more guidance in what these areas can do and also to check like how are people interacting with that in order to get this kind of data and feedback back in other areas we simply don't have the resources to do everything and we know that there will be other organizations outside universities companies and so on who will be doing a lot of the stuff so we have to strategically place our bets in the understanding that the wider world is doing also something and we don't necessarily have to replicate what we're doing and we are probably uniquely positioned to understand you as a contributor's best and to find places where we can really help you with those things whereas most of the others are more focused on on readers and bringing the knowledge out to the wider world so just for anybody who came in recently we'd like this to be an interactive session so if anybody's working on anything feel free to raise your hand and talk about it. We will repeat the question that you asked as well. It's not about something I'm working on but it's a question I would like so I was quite interested to to learn about the the move from language specific models to language agnostic models and my question was I actually had a question that tied into this which is how does the represent the language agnostic representation that we want for abstract Wikipedia tie with these kinds of models and with the claim that some larger language models already have some sort of language agnostic representation. Right so Google's natural language and station models for example since five or six years have this claim that they have some internal mentally representation of language. I've looked into this quite a bit and the thing is those things are not exactly human understandable so using those things as a interface for humans to contribute text and to to work with it I wasn't exactly sure that this would work so for for abstract Wikipedia one main thing is that everyone can contribute this content so it must be reasonably understandable how to use it actually how to apply it and so on so but the interesting thing is I think it is easier for an LLM to adapt to our guidance than the other way around. I mean humans are even smarter they could figure it out if you really need to but why shouldn't we focus on creating something that's really really good for us and we'll get the LLM to train on that one then we can produce training data we can use this to guide the LLM and so on. I wouldn't look too much into the mental ease that those models are creating those are often highly specific for the individual models and whatever you learn for the current version of gbt that's out there is not necessarily anything that is connected with what the Google natural model does for example and so on so I would say let's focus on making something that's really good for us and that the models will catch up. One thing that I think is somewhat related not not quite entirely related that I would add from some of the learnings about the chat gbt plugin is people have been using it all around the world in different languages and we've been studying the statistics on how much it hallucinates in each different language and we've come to the conclusion very quickly that it hallucinates way more in certain specific languages than others and so it's hard to predict even which languages that happens in and so I think it's kind of important to think very carefully about how to roll these things out where to roll them out how we talk about how accurate these things are as well before they become sort of robust and universally used products. I did actually the experiment of asking factual questions in different languages to the current version of gbt and it was interesting to see that it has completely different knowledge bases in the different languages for example for some other things for simple things like what's the birth place of this person and so on it would give me different answers depending on which question I ask and this doesn't bode well for those. So any other questions before we switch gears to another topic? The next topic then we're talking about the wikimedia's role in training diffusion models and large language models and so I think that everyone is familiar with the idea that when you create one of these it ingests a lot of data and then it connects weights to certain different words and the connections between words and so how do you get those words you get them by scraping the internet as well as scraping large data sets like wikipedia or in terms of images wikimedia commons and so the projects represent a pretty significant amount of the training data that's used and it's also sometimes more highly weighted than other training data and so this clearly presents a lot of opportunities right because we are integral in the ecosystem so maybe we can talk about opportunities and challenges that this may present. Yeah so from the wikidata perspective I think there is a huge opportunity for improving large language models with making more use of the very structured model knowledge that wikidata has so for example when you when something new happens in the world something interesting as we all know happens every day very quickly it happens on wikipedia very quickly it happens on wikidata and all of our other projects and it would be in my opinion a very useful thing to improve those large language models by taking what wikidata has as structured data and using that as structured data instead of trying to to learn from it and from a huge amount of text which will take you so long to accumulate for something that just recently happened for example like a recent election it will take time until there's enough content online and the LMS retrained to really get that again in the next training round so yeah I think there's a lot of opportunities there to to improve them. There's one there's one thing that I'm particularly afraid of so the modern LLM models all have all have this feature of catastrophic collapse which means if you start training them on output from LLM models they actually become suddenly much much worse and this is kind of an oral boroughs thing if you're into mythology and the thing is that as Sean said those those models are all trained on wikipedia very heavily wikipedia is often like rated with a factor of least two or three or four in more in those models because it's such a good data set for answering questions and so on and therefore it's also no surprise that these models are really good at spitting out things that look like encyclopedic articles because I mean they've been trained on those specifically so you might have to think oh actually we don't need wikipedia anymore we can all get this thing out of this system and the results will look really good and very promising until the moment you realize that as soon as you start building an ecosystem on that it will probably collect catastrophically within within a reasonably short time within probably half a half a decade or something like this but by the end it's enough time for example to kill something like wikipedia so I'm afraid of a scenario basically where we it's not really a skynet scenario but still where we have these ai's coming in and looking like a decent replacement for wikipedia reaching even an audience that would surpass ours cannibalizing all interest in our projects besides a few hardcore people basically making us completely um there's a sidelining us just in order to then collapse and not giving us a space to to breathe anymore as a project and I wouldn't have no idea how to avoid this kind of scenario to be honest this is obviously outside of what we as as the foundation can do and so on but um so this is one thing that I'm really afraid of and I'll repeat the question so so yeah I um also thought about this scenario already and my idea is so and that and the nai is never perfect I would say so um but wikipedia is also not perfect but the way to make sure that wikipedia can survive in the time of ai is that wikipedia will always the better source so um yeah we still have to improve the quality the the updates and all the things so I would say that would be the best way that wikipedia can survive in a time of ai and that we should focus on that strategy how to get better always than the ai yes thank you yeah denny I think what the point you're making is excellent and I was just wondering um the same way as some researchers are trying to speed up evolution in the laboratory is there is there a way we could do this for this catastrophic collapse is there a way that we could simulate um what would happen if we would go down that route that wikipedia would stop being improved by volunteers by people by readers and that we just start train it on itself and show basically to the to the academic community like that might not be the best path to go down I'm very yesterday so I mean it's known that this thing happens already so there's a way to actually show it I'm just wondering whether a silicon valley startup that creates something like an aipedia would ever care about that or not and if the wider audience would care about using that or not if it's it feels better at that moment so even if we show it even if you prove it conclusively which wouldn't happen but even if you give very strong indications I'm worried that it wouldn't be sufficient to actually avoid the scenario oh by the way may you say when you say something introduce yourself quickly I just want you to introduce and then uh yeah I'm Kevin Goldie user name is Vicollo I'm mainly in writing in German wikipedia and uh yeah okay that's it yeah I'm Lodovic Dutch wikipedia and currently in California I think there's there's so much here so there's there's actually two collapses because there's both the collapse of the quality of the models themselves which is a problem to companies like open AI where if you poison the database by training it on AI created language the quality goes down and then there's the collapse of course of wikipedia and the projects by people moving towards these models rather than moving towards contributing and I think that the academic community is already pretty clear on the idea that yes these models collapse very very very quickly when trained with AI uh content and so I think that the companies that make these models I mean the for-profit companies at least are paying attention to this pretty heavily and so this has been the type of conversation that has happened uh with tentative conversations with AI open AI that the partnerships teams have had so they recognize this is a problem whether or not they actually do anything to support the project to fix that problem it's unclear Hi Thomas Schaffi username evolution and evolvability from Australia um one of the things with uh you mentioning the problem that people could start using by default information article you know whether it's encyclopedic style information or just you know google search style information generated by AI instead of going to wikipedia to an extent that's also just reputation is a huge factor in that I mean it's not wildly different to the idea that anyone can fork wikipedia or mirror wikipedia those typically have way less viewership than wikipedia itself partly because wikipedia has the reputation it has for better for worse you know it's an improving reputation and so perhaps the solution may not be fully a technological one part of it has to be the trust the branding the public perception the way that wikipedia presents itself across different formats okay I would like to make okay my name is Kiro Semyonowski I am from the Macedonian wikipedia but I'm also active on wikidata and wikimedia commons and on English wikipedia I would like to give a different perspective on this I admire the work that you're doing and I think that the future is in the AI but the thing with the wikimedia community is that we need to use it in the way that it helps a lot of work that volunteers don't want to do for example I would prioritize using AI and machine learning techniques for example to wikify pages to improve the quality in the sense of adding hyperlinks or formatting the text or doing stuff that volunteers usually don't want to do and have a hard time with newbies as far as the generating articles using AI is concerned I think this is something that volunteers running bots used to do even 15 years ago for example you could run a bot to create articles on celestial bodies or villages and if you write a good script then even the content of that article would be sufficient beyond the placeholder so I think at this stage because we know that contributing as a volunteer is always a joy and I think that most of the people willing to contribute to the wikimedia projects do it because it makes fun for them and they see it as an opportunity to socialize to learn something new I think the main priority at this stage should be to use AI smartly in a pragmatic way so that people don't want don't need to spend time on doing things that they find it hard or not that's enjoyable thank you yeah sure I want to maybe share two things one immediate follow up on what you shared I think your perspective is at least like in good parts of the wikimedia foundation is shared so the priority of attention is not article creation at least not on our end if the communities decide to go and do that that that's fine and the communities can choose to do that but for us there's a lot of focus on existing communities and how we can support them with this piece of technology as well as future audiences that we may leave out if we don't think about so I'll talk about this on Saturday we now know that in a project like french wikimedia 80% of the edits that happen are maintenance related this is significant amount of edits that are happening in this project as a relatively established project and the question that we face is how do we support how do we use this piece of technology plus other pieces of technology that exists to best support the editors who are doing that work on the project because that is key the other thing I want to say is on the topic of risk yes there is risk and we always you know geek out about it and talk about it Mariana in her in the opening session yesterday talked about what the world needs from us and from wikimedia and I will say in the context of AI and this particular conversation I will say you all we all have a place to say what do we need from the world in order to be able to do the things that we are doing on the projects so if the ecosystem around us is shrinking if a lot of projects are deciding to you know not share their data openly that is a bad news for the ecosystem that we are in and we as a community need to align need to have demands need to have clear demands and you know the foundation is one of the entities that you know through partnership or other efforts can support the communities in this way but there are also other entities within the movement but I think it is important for us to also not feel that this is happening to us we have a say we are shaping the knowledge in the world so and we may have demands one perspective from the legal department that although not entirely related to AI directly relates to what you're saying is we during the fiscal year when we're creating what our goals are one of the most important goals this year has been to help existing editors work more efficiently and have more time to do the things that they actually want to do and spend less time in terms of maintenance content moderation and things like that and so I think that a lot of different avenues are being taken including thinking about developing products thinking about changing certain new user experience things to make it easier for people who are doing a lot of work on the projects already I'm Heather Ford from Australia I was wondering about the legal aspect actually oh great well just going back to Denny's core challenge which I guess is the is the one that we most have to worry about are there any plans on the foundation about enforcing the license in any way or and or I know that you guys are doing a lot of work around AI regulation but are there plans on trying to make some demands to the companies that are using the content sometimes in explicit explicitly against the licenses there are a lot of aspects of that question and I'd hope to hear from Kat as well from creative commons but first to answer your question directly number one there's a really strong tension between the goals of the licenses which is to broadly disseminate free information with very limited sort of attribution requirements and then the share alike requirement which allows it to be shared more broadly so I think the legal department's current perspective on this is that the licenses do allow for broad uses including innovative uses like training models to the extent that there is a tension there's a very strong tension because it does require attribution and many of these companies particularly proprietary companies don't even talk about what's in their training database so that is as far as you get from attribution some of them that do talk about what's in it don't actually talk about the specifics of what's in it and they don't talk about who actually created it from like a robust way and so you can imagine even when some a company like Open AI says we use Wikipedia data that's not quite enough especially because if Dali is for example trained in Wikimedia Commons it's each individual contributor who is the author who needs to be attributed in a certain way and I think this sort of shows the broader tension that people have with these models it's like to the extent that people wanted to participate in openness they at least expected in exchange to be rewarded with attribution and also contribute more broadly to everyone getting this knowledge rather than it being locked down I think the final tension here which is really important is that although the output of many of these LLMs is in the public domain because of the default copyright laws in many jurisdictions they can contractually prevent these things from going into the public domain by saying for example in a terms of use hey if you use our thing then here are these extra problems with it so again tension between how they came up with the weights to make this stuff how they trained it and whether or not they said it and finally how they actually treat the output there is no current I mean there are so getty images for example is a good example of a company that has explicitly said we would like to stop you from doing this and if not we'd like to get paid so that is that's a perspective that we are almost certainly not going to take it's on yeah Reddit is also another case where they've decided well we will shut down your ability to access information unless you pay again not something that I think that anyone in the legal department is interested in going down it's not a path that we're interested in going down however the path of as an industry as a whole everyone needs to figure out how to conform with the basic attribution and share like requirements it's going to be really important to us in the future and I think that we can be a leader in that because we're also so integrated into all of these systems I don't know if Kat has anything to say sure hi I'm Kat Walsh I'm the general counsel at Creative Commons and as you might guess we've had a lot of conversations with people at wikimedia and at open AI and stable diffusion and basically everybody interacting with CC licensed material in this space and it's one of the things that we struggle with at CC is that the licenses only apply like when you do something that you're required to seek permission for under copyright and that's like making a copy distributing a copy but what these AI systems are doing is often like not making and distributing a copy distributing the training data sets sure because that's like distributing the material but like training on that data like is that something that is making a copy we're really hesitant to take that position because if you take that position the implications are like if you as a human read and study something and you you know write a wikipedia article from it like have you made a copy do you need permission from that copyright holder if you're doing something like that involves a computer like text and data mining and you you do research based on that do you need permission from the copyright holder have you made a copy we've got a lot of blog posts at CC and if you want to hear more from me about what CC is thinking I'm giving a talk tomorrow and you can ask me all the spicy questions you would like to but but in general we have been hesitant to say that you need it and this is totally at odds from how people think about like what should deserve attribution like when should you cite your sources and I think that the copyright aspect and the like what should you do can be completely separated that people can cite their sources can drive traffic back to the source communities like as well as they can and whether that's connected to copyright licensing in any way that doesn't have to be but just the ways of doing that and how you should do that is a is an open question right now okay I have a question specifically for Leila there is a myth that AI is independent from human interactions which is not true so we need to make sure that our active editors our community is interacting in the right way with the AI that we have on the project so did the research team conduct any online experiments in which community members are tested and their their propensity to work in an AI supported environment was tested in some way I'm not aware of research definitely not in our team in in the way that you're proposing it however I'm willing to talk with you outside of this room maybe at lunch or something or afterwards and then we can I can understand better the question that you're trying to get at and then we can see if we can create something out of it hello I'm Timon my username is HAB was here in Wikipedia I really want to push back a bit on this assertion that Wikipedia is so essential to all the LLMs and that would lead to contacts, contacts collapse within a few years so it's right that if you look at existing models why that open either we are three or five percent and as often it's weighted more heavily as of the most reliable data set there's other sort of foundation actually published a post I think last month linking to analysis which look at a fine print it had Wikipedia was the most widely used one second most by the zero point one nine percent and there was also the New York Times article which was great overall which I think many have seen and this has called like without Wikipedia there would be no generative AI and I had a conversation after this with the researcher who says this Nick Vincent who actually who did some great studies on how Google relies on us and I read in the end that it's so he clarified he really didn't mean as I say it would be impossible to have current LLMs without Wikipedia it's more like we were there we were available right we say we put our dumps out there it's easy to download so in some sense Wikipedia was actually a convenient sample and I try to get the that we're highlighting this point right I mean if I was foundation leader who are doing PR for the foundation which I did several years ago I would highlight this right hey look look all these models are using us but within us we really need to be clear this is not a given that we're in session and the last thing I'll say there's some interesting research so basically people don't know what we're talking about if you really wanted to find out how much worse would the model be without Wikipedia it's really hard to do that's an interesting overview article for workshop last conference last one called 10 law a large machine learning conference and they say it would be very expensive to find this out I mean really foundation could find such an experiment but before that we need to be clear on ourselves at least that it's not a given and there's a lot of high quality data like textbooks and all these source resites that are not in a trained data set in which so far we all spend the same time more reliable than us so I really warn about this padding also have too much on back and being too confident about this one one thing I can't necessarily speak from a technical research perspective as to how much or how little we play in terms of making these models work correctly but what I can say is from a legal perspective the question about how training works where training which country's training can happen in because of copyright laws and other database type laws is extremely real to all of these large companies trying to monetize this and so to the extent that we are not only a large database but we're also a large mostly free database is critical and so for example you might have a situation where lots of textbooks exist but those textbooks are under owned by publishers and it's unlikely in a race between publishers trying to sue these companies out of existence and them trying to continue using their products I think that there will be a move towards more freely licensed material being used for training which again sort of facilitates this idea that we are integral to the process but from a technical perspective I don't know I think I want to add two things to as maybe follow up on Tillman's point one is that I think it is it is important for us to continue to create spaces where people can call out what they see as potential risks I think this is a space that is rapidly evolving it's a disruptive piece of technology or series of technologies that are happening and it's important for us to have spaces with without in which without fear we can talk about what are the risks that each of us may see in this room for these kinds of technologies and just hear it process it maybe it is maybe it is not but let's have the space for having those types of conversations and hearing from one another because the space is moving really fast in terms of advancement so we need to just continue listening to each other and just hearing each other and of course not be afraid of the risks right again like I center us back I invite you to come back to the place that as we comedians over the past two decades more than two decades you have built a system in which you're if you have basically a radical system for knowledge governance around the world like that's a big deal that's not you know one edit in one page it's a it's a model and that model has been evolving for more than two decades and I think it will continue to evolve in the face of this technology yeah so I mean currently the AI AI data are mainly from trained from Wikipedia but I think in the future when it gets broader when it's more competition they will use also data directly from the scientific papers and so on so I'm not sure how long they will need Wikipedia in the end so I see a high danger that the whole Wikipedia could collapse due to AI so maybe my question is because I see it also as a risk that one day a commercial company maybe from Elon Musk yeah have all the control about the knowledge of the world and yeah maybe is there also are there also thoughts in the foundation about creating an own AI about with all the knowledge to get and all the that way on maybe so to create it from our side already to just one of your premise before getting the actual question training an LLM on scientific publications directly I think that Elsevier and Springer have a certain different view on copyright and how much you can use then the foundation does so I'm not sure that would that would go very smoothly to be honest but it would be very interesting to watch yeah just quite quickly yeah from the legal department's perspective I think that Denny is accurate in terms of yes their scientific papers that are absolutely locked down and these people want to charge large amounts of money particularly for commercial uses like this that disintermediate them from your perspective there are many openly licensed scientific articles and more and more every day I mean we actually support efforts right to openly license as many scientific articles as possible and so to the extent that there might be certain use cases that do completely disintermediate certain aspects of Wikipedia I think that's real I think from a general perspective maybe Denny's point is also accurate in terms of to truly generalize the amount of scientific information for example in the database you would need licenses from hundreds possibly of different copyright owners but it would be very entertained watching a lawsuit between open AI Microsoft and Elsevier that's at least as entertaining as a cage match between muskand but maybe to the point of actually what ideas have been thrown around around creating our own models can I so I just want to follow up on this so this is one of the places that there's no decision that has been made there are discussions and my team is involved in parts of at least those discussions around whether we should build LLM for multi-purpose basically not for very specific purposes but like use this technology to basically build general purpose systems for Wikipedia there's again like no decision there are discussions basically right and there are considerations here that we're having that I want to share one is the issue of languages we cannot leave languages behind and we cannot further the gap with the actions that we take at least in significant ways like it has to be in coordination with the communities and that is work that is happening on the ground as was mentioned before these technologies are primarily right now targeting a few top languages and the question is what about the other 300 plus languages that we currently care about and the other 2000 so languages that we want to welcome to our projects so we need to think about that and we need to be intentional about that it can be that we decide for some languages for some projects we will offer certain types of technologies and for some we don't or we decide we want to do it for all but this has to be intentional it's an intentional choice and we need to look at the tradeoffs the other thing is that we're experimenting with building basically utilizing LLMs in specific places like for example we are we now have models that we're testing on on basically predicting the probability that a revision gets reverted using LLM and we're seeing that these models are quite expensive they're quite expensive to develop and maintain and there is a question of how much I mean the budget of the foundation is limited as you all know and there's a reality about like how much we can utilize this and we also think about opportunities there are partners out there right hugging face is a partner is like they are in this ecosystem they're doing tremendous amount of work in the open data open science world and we need to also be opportunistic and look at these partnerships and relationships that we can build and see what is the right thing to do but I will say that our approach so far has been we need to be intentional we can't just jump and just do it. Hi I'm Tamzin Dr. Thanid from New Zealand I just wanted to point out that my concerns about risk is that there will be effects on the wider ecosystem about open licensing and there already are because I see institutions that openly license things these things I work on the New Zealand thesis project things that were previously available you now can't copy and that's just happened in the last few months and I see it as a direct result of these kind of large language models and people not wanting to contribute necessarily to that and I think we're going to see some more institutional changes like that that might actually restrict our access to things that have that may remain openly licensed but are just much still somehow less accessible to us. One thing that I will react to that is I think there's going to be a push towards less open license and more licenses with certain caveats to them and so there's already open licenses with caveats like rail licenses if you're familiar with the term and so I think that people more and more might say yes it's open but no AI yes it's open but no commercial use yes it's open but yeah I want my name I'm Andrew my username is ohana united so I edit on English and Chinese Wikipedia and I kind of echo what you said about the don't forget the other languages because most of the existing AIs for the day are trained on using the English datasets so whenever you try to ask a question using say Spanish Chinese French you would often get a really bad answer or sometimes even incoherent answers so I was wondering like if like or maybe even just commenting on perhaps for English maybe the horse has already left the barn for the other languages maybe we should embrace with those companies because we have seen like for example Tencent and Alibaba those those are really big Chinese companies they have thrown billions and billions of dollars and it turns out when you ask those questions on the Chinese engine using English you get a better result than asking them in Chinese so I think we and definitely for other languages which do not have as much capital involved to invest in these kind of language training so I was hoping that the Comedia Foundation could use that as an embrace this opportunity to welcome those other languages that have not been as well supported online and in the AI. I think the only thing I'll add to what you shared is that we haven't talked here about what happens on the talk pages if machines enter the talk pages right and as we are thinking about adding content to the languages we should also all keep in mind like basically the challenges of you know conversations on the talk pages and machine entering that conversation and how does the community want to this can be an excellent segue to the next session One observation that I'd like to add is in terms of working with companies like Tencent and such we're often limited by who is first of all interested in even talking to us but more over interested in talking to us on the terms that we think are important I think a lot of companies are very very interested in talking to us in terms of how do we make money with training data how do we make money with these things but they're less interested in talking about the things that we want to talk about which is how do we preserve an open ecosystem for free information which is not necessarily entirely at odds with what they want but many of the times it seems to be Hi how are you very interesting the talk we've been talking about some fears that services like chat GPT could replace Wikipedia when we talk about new audiences for example but I am curious if there is some research if that is already happening or not because for example other platforms there are a lot of reports about for example stack overflow and on some others that this is already happening so I am curious about if the foundation is measuring this researching about this or there are some data today sorry I'm behind you so in terms of research it's I would say going slowly right now however we are intending to invest some effort to understand how people are engaging with like you know the experimental technologies that the foundation is developing for example the chat GPT plugin for Wikipedia so we are going to spend some more effort in the space if you have ideas about some particular things that can be interesting to you or your community please come and talk with me Danny Lydia I can transfer as well we have just we would be happy to talk yeah we have time for one last question but we are all it's for him but we are all here afterwards and we'll be here all weekend to talk about these things there's also at least two more panels on related topics one of which is happening right after this at 1115 Singapore time and then a specific topic talk about languages and how they relate to models okay I'm I'm not from Thailand and it's a Thai Wikipedia that that you mentioned that it's a Lancret that already left out it's Thailand but you you cannot have like a chativity is very good in Thai and it's really bad right now and I think a Kookabart or Lang Shen or like a Bing it's still bad for Thailand Chris and in the near future I think like English it gets better and better like a Spanish but Thai is still not that good enough probably like in the future we are already left out again that what I want to share we're here till they kick us out yeah very quickly hi my name is Pilar Sainz I'm from Quiquimedia Colombia so I'm very glad to be here but I was thinking about where is the places when the this regulatory discussion are opening because sort of us are trying to participate for example in Waipo but it's impossible for many of the people from the movement particularly in those discussion around open licenses for example or the use of yeah so where are these spaces and how we can try to participate as community that's a really good question it's probably too much for the time we have left but the two things that I will say are number one through the normal kind of regulatory process so the EU has the AI act so those are the types of things that people who are EU citizens for example can engage with the second thing is engaging on a private grass roots level with these companies because from our limited conversations with them they are very very interested in how they look to all of us who are the users of their products and the people who see their brands and so if someone has a complaint about these products not working in a certain language if someone has a complaint about these products using certain types of training data that they believe are fundamentally unethical say something about it social media and so I think they really do pay attention to that stuff I know it's sort of like a not great answer and we could talk about the details of it but I think those are two things that are important all right and so thank you everybody that's time thanks really appreciate everybody coming and there's more AI talk in the future the two other panels and AI will be part of your future soon enough