 You all here. Yes, thanks, Emily, for starting the recording. Christina, if you can share and encourage. Everyone, please enter your questions in the chat, in the Google Doc that Christina has shared. And now we'll start with the paper with the second set of papers. So the first paper is titled, Going Down the Rabbit Hole, Characterizing the Long Tail of Wikipedia Reading Sessions. And the authors are Tiziano Picardi, Martin Bilrec, and Robert West. OK, can you see my screen? Yes, we can see you. We can hear you, Tiziano. OK, perfect. So hi, everyone. I'm going to present to you a characterization of Wikipedia rabbit hole behavior. And Tiziano Picardi appears to be a student at the PFL. And this work is a result of the collaboration with Martin Galakoff of the Wikimedia Foundation. And my advisor, Bob West. So imagine you are, you hear about a TV series called The Last Kingdom. You check on Wikipedia, what is it about? And you see that it's about Alfred the Great. So you're curious to start and just start to follow the links. Discovering that he fought the Vikings. The Vikings that created Finland, a settlement in North America. And where they used to make wine from berries. And so on, going to fermentation, Neolithic ancient Egypt. You end up reading about the polytasma and the role of pharaohs. What just happened is called falling in a wiki rabbit hole. Being dragged by Wikipedia in a long session, where you get lost and learn about diverse set of topics. The name is, of course, a reference to the novel Alice in Adventures in Wonderland. This is a behavior well known in popular culture. But what we know is based only on anecdotal reports. In this study, we characterize these wiki rabbit holes in a data-driven way, by investigating the digital traces left on the server by the readers. To support our analysis, we collected one month of server logs off the entire Wikipedia in English. In total, we collected more than 6.5 billion page load events that we carefully anonymized. And thanks to the HTTP referral field, we transformed these logs into navigation session that connect sequential click of the same user. But how do we recognize rabbit holes session? There are multiple possibilities. But in our analysis, we consider as a fall in rabbit hole, when the depth of the session is at least 10 articles. By applying this rule, we retain around 0.24% of all the original sessions. And as the title of the paper suggests, we are exploring the long tail of the navigation sessions. Let's see now some of the properties of these long sessions. First, by looking at the most common entry pages, we notice that often people start longer explorations from articles about election, television shows, and historical witnesses. All these articles have one thing in common. The infobox has navigational links, such as the sessor or successor, that are used by readers to explore all the articles of a series. A second interesting property is that when we look at the temporal dynamics, we notice differences between day and night. In this plot, we have the fraction of rabbit hole session by time of the day divided by device with day and weekend. We notice that the fraction of rabbit hole sessions is higher during the night, with an increase of almost two times from mobile. But what topics are associated with rabbit hole sessions? So to answer this question, we use regression analysis. First, we assign to the first article of the session the topic vector obtained from ORES, the official Wikipedia topic classifier. And then we assign the positive label to session with at least a depth of 10 articles, and the negative one to all the others. Then by fitting logistic regression, we obtain the topics that are mostly associated with falling in a wiki rabbit hole. The co-efficient show that the topics such as libraries, history, and entertainment did reverse two longer sessions, while articles about stem and medicine are more associated with brief interaction with Wikipedia. Finally, the next natural question is how do these sessions evolve beyond the first page? For example, do people move semantically far from the first article when they navigate the content? To answer this question, first, we projected all articles in a topic space obtained from ORES. We basically use the topic vector as the article position in this space. And then for each article, starting a session like a great triangle in this case, we look at the evolution of the trajectory in this space. To have a null model to compare with human navigation, we created a random worker that navigates Wikipedia by picking a random link available on the page. And for each trajectory, we run the random worker starting from the same document and for the same number of steps. And finally, we compute the mean square displacement that is used in physics to measure the dispersion of particles from the starting position. It is basically the average of the square distances. With this metric, we can plot the diffusion of the random trajectory in the semantic space. Each line in this plot is the average of the displacement of the trajectories of a given length. When we add the human generative trajectories, we observe an important property. They don't converge to a random location. This is important because it means that readers stay in the semantic neighborhood of the first page even for a long session. So on average, in Wikipedia rabbit hole, readers get lost, but in a set of consistent topics. So in conclusion, we learned that Wikipedia rabbit hole sessions are affected by the structure of the page, are more frequent at night. They start more frequently from articles about entertainment. And on average, they don't lead the readers to a complete random page. Thank you for attention. I'm happy to answer your question. Awesome. That's time for one question. Yeah. Thank you, Luciana. There are many great questions, so I'm sure you will answer later. To pick one here in this work and in this presentation, you focus on the long sequences, but how might navigation in such rabbit hole sessions compare to different navigation paths and different navigation strategies that one might see either in games and artificial scenarios or in a real navigation in the wild? So this work is a follow-up of another paper where we investigated more broadly the navigation. And we also compare with the session generated by games. And there is an important difference that is in games. There is a clear definition of success. So you know exactly where the user was going. And so you explore how the user navigate this network to find exactly what is the goal. In natural navigation is very different. You don't see this kind of going to a app, meaning a page that is generic because you are going to a central node to find the destination page, but is in average a lot shorter. People attend, for example, 78% of the sessions are composed by a single page loop. So it's a hard check and then a little bit bigger. And then the exploration is different. It's not, they tend to go to the periphery of the network. So they enter in very popular pages from a search and then they diverge to a periphery. And then many other differences but there is a full paper about that. Awesome. Thank you, Taziano. We'll move on to the next presentation by Nicole Schwitter. And the title of the talk is Offline Meetups of German Wikipedians Boosting or Breaking Activity. Thank you all for the introduction and the nice presentation before. So I will dive right in. What I'm presenting today is a part of my PhD thesis that is still a work in progress. And in my PhD, I research, well, Wikipedia and I particularly focus on the people behind it, on the people that collaborate together to write this online encyclopedia. Now I'm less focused on the online component but I focus more on the other side, the offline side. So whenever people, whenever Wikipedians go out in the real world and put on their nice t-shirts and meet and face to face to get a face behind the username. Those informal meetups can come in all shapes and sizes from the, from meeting up in a pub and drinking a beer together or going on a hiking trips or organized barbecues or also be more work related in the terms of like open editing events. So those offline meetups are what I'm interested in in my PhD and I'm looking at how offline meetups influence online behavior. And today I'm focusing on one specific domain of online behavior. So I'm looking at how offline meetups influence online contributing behavior. So editing on Wikipedia. To answer that question, I look at the German Wikipedia from 2001 up to 2020. And on one hand, I have contributing behavior that I take from the data dumps where I look at the metadata. So I know who edited what when and I look at meeting data. Now meeting data is less, is not process generated. So it comes less structured but it's still available because meetings are organized on the platform. So my goal was to scrape all organizational pages. Those pages kind of looked like the screenshot on the right. So you have a list of attendees and list of apologies and also minutes recorded. So my goal was to scrape all meetups organized on the German Wikipedia in that timeframe. I ended up with over 4,400 meetings that took place in the timeframe, most of them while 99% taking, being organized in the German speaking area. But 1% did take part globally in 20 different countries. So far to the descriptives. Now my question for today was is to identify the cause and effects of meetings. So I'm interested in whether those attendees with their t-shirts edit any different than a comparable group that did not take part in meetings. So what I want to do is I want to compare on the right the meetup attendees, the treatment group to a control group of non attendees. So for each of my attendees, I selected one similar twin, similarity being defined as being similar in tenure and past activity, and then I can compare them. So my control group on the left and my treatment group on the right are similar up to the point of the meetup. And now I want to compare their behavior after the meetup. So I have a quasi-experimental setup and I can use a difference in differences design. So that means that I want to compare the changes before and after the meetup across the actual attendees, the treatment group and the corresponding twins, the control group. In this presentation, I look at the long-term change. So I look at one year before the meetup versus one year after the meetup. In the paper, I also look at shorter-term changes and I break the process into two separate parts. So first, I look at meetup and I look at users which have not made an edit in the year before the meetup and then look at if they make an edit after the meetup, yes or no. And I look at users which have made an edit in the year before the meetup and look at the change in the number of edits. So I have either did people that not edit before edit after the meetup and if yes, to what extent did they change their editing behavior. So now on the left, I look at this binary decision. So people did not edit before and now the intercept is the baseline probability and what we find, what I find is that 6% of people that did not make an edit in the year before do make an edit in the year after. So there's a 6% probability to start editing in the year after if you didn't edit before. However, if you're in the treatment group meaning you actually attended a meetup, your probability is 25% higher. So if you actually attended a meetup, your likelihood to start editing is at 31%. If we look at the extent, we find a negative intercept. What that means is that people make on average 12 edits less in the year after than in the year before a meetup, at least if they are in the control group. If they are in the treatment group, meaning they actually did attend this meetup, they only make about four edits less. So there's a negative trend, but it's much less pronounced in the treatment group than in the control group. So to summarize what I find is that meetups have a positive effect on Wikipedia. Users that do attend the meetup are much more likely to start contributing again after the meetup if they have not been editing articles before. And while it is not the case that users increase their contributions after a meetup in comparison to before the meetup, the reduction in contribution is less than a reduction in a comparable control group. And if you want more details about methods and more detailed analysis and directions for future research, please also look at the paper and come to the poster session. That's it. And I'm looking forward to questions. Thank you. Thank you very much, Nicole, for the good presentation. One question is how did you, oops, I missed it. Do you also have any data outside of the German Wikipedia? And then also thinking about next steps in your research, what are the next steps that you're planning to do? And also if you can also do active experiments and do some community campaigns, what would be ideal experiments that you could envision? Thank you. So many questions. To start with first one about language versions, I'm only focusing on the German one, scraping all meetups that take me about one year. So there's just, there are many meetups and many pages to read. So it just does take a lot of time, especially going to larger language versions. I did look at the Atlantic Wikipedia as well, but I didn't really analyze it and used it mostly as a toy example. I did forget the other questions. It was just about the next steps. Our next steps. So yeah, in my PhD in general, I'm also looking at how online behavior about offline meetups influencing elections and norms or reverts. And so I'm currently working on that. Okay, thank you. There are a lot of questions. So I'm sure you'll have that. There are several questions for you, Nicole, which I'm sure people will ask you in your poster session. Thanks, Christina. We'll move on to the third and the final full talk in this session. And the title of the paper is The Role of Online Attention in the Supply of Disinformation in Wikipedia. The authors are Anas Eleviari and Giovanni Luca Giampaglia. So over to you, Anas. And everyone hear me? Yes, we can hear you, we can see you. Okay, awesome. So my name is Anis and I'm a PhD student at the University of South Florida. And today I'm presenting our paper on online attention and disinformation in Wikipedia. I have worked on this research alongside my supervisor, Dr. Giovanni Giampaglia. So there exists many potential threats to Wikipedia's knowledge integrity. One of these threats is the creation of hoax articles, which are fake or fictitious entries that were deliberately created. For example, this is a hoax article that was created about a fake Australian God and lasted around 10 years without getting caught. Vandalism is another common threat to Wikipedia's knowledge integrity. However, hoaxes are different from vandalism in the sense that vandalism defaces existing articles while Wikipedia, while hoaxes are completely new entries. Vandalism attacks can take the form of sexual insults, humor, page-planking as shown by this example. In a research conducted by Dr. Pomer and colleagues amongst them is Dr. Robert West. They showed that 92% of hoaxes get detected within the first day, with one in 100 hoaxes survive for more than a year. So people's behavior online is influenced by both endogenous and exogenous factors. And these factors in turn shape how we produce and consume information on the web. And in a study conducted previously by Dr. Giampaglia and colleagues, they studied the creation of non-hoax articles on Wikipedia. And they showed that there is a sudden spike of attention right after the creation of Hurricane Sandy entry on Wikipedia, which means there was a need, there was a demand for an article to be created about that topic. They then went on to show that the demand drives the supply of information for non-hoax articles. And the unresolved question is what drives the creation of hoaxes on Wikipedia? And to get a step closer into answering that question, we tried to see whether online attention in the form of traffic to Wikipedia toward a topic increased the likelihood of disinformation in the form of hoax articles being created about it. And to do so, we've collected a set of known hoaxes, which is 190 hoax articles that are kept within a page maintained by Wikipedia moderators. And these hoaxes are successful in the sense that they evaded detection for more than one month or discussed by media sources. This plot shows how many hoaxes were created for each year. And as we can see between 2005, 2007, that's where the majority of hoaxes were created in that set. And that is parallel with Wikipedia's known peak of activity during that time. And in 2008, we see a decline. And that's because the MPP process, Wikipedia's patrolling process started in November, 2007. And we've mentioned the word topic before, the topic in our research, we define it as the known hoax outlinks, which is the pages that are linked within a hoax article. And we are studying the traffic for that topic 14 days and over a 14 day observation window centered around creation days. So seven days before creation and seven days after creation. And for each hoax we calculate the relative volume change, which is delta V over B with V of B representing the topics median traffic seven days before and V of A representing seven days after for the topics median traffic. And if that value is positive, if the relative volume change is positive, that means that the hoax accumulates more attention before creation than after. However, to affirm that claim that it is not unique to hoaxes, we have to establish a baseline to compare our relative volume too. And we collect what we call a cohort. And a cohort is defined as all the non hoax articles that were created on the same day as each hoax. So we have 190 hoaxes, 190 cohort for each hoax. And this cohort was collected after resolving redirects. And this plot shows that the inclusion of redirects not only inflates the cohort size, but also can skew the traffic analysis that we do, which is getting the values of delta V over B. And inspired by the work of Dr. Kumar and colleagues, we studied that parents' features to understand how hoaxes differ from cohorts. We're not gonna get into detail about all of the parent's features, but we're gonna point out one feature, which is the weak link density represents by how many outlinks exist within a page per 100 words. This graph simply shows the Z scores of that appearance feature. If the Z score is positive, that means that the hoax article tends to have more density than the cohort. And if the Z score is negative, it's the opposite. And we can see in this graph that it's nicely centered around a zero, which means that we can eliminate any confounding factors that can exist from our analysis due to different linking patterns, such that hoaxes and cohorts have similar linking patterns. So that's why it's an apples to apples kind of situation when we compare hoaxes to cohorts. And this graph shows just a sample distribution of delta V over B for one hoax. The turquoise bars shows the distribution of the delta V over V for the cohort. And the red line shows the delta V over V for the hoax and the black dash lines shows the average for the turquoise distribution. And if the value of the hoax, if the red line is to the right of the black line, that means that the hoax compared to its cohort tends to accumulate attention before its creation than after. And to better understand if this applies to all of the hoax articles within our dataset, we calculate the difference. The difference is simply the subtraction of these two lines. And if this difference is positive, that means that firms the case that attention is accumulated before creation for the hoax. And this shows the distribution of the 190D values that we got. And the mean in this case is positive, which means in most of the difference values, hoaxes tend to accumulate more attention before creation. And we constructed the 95% confidence interval using bootstrapping. And we can see the sample mean falls within the 95% confidence interval. And to conclude, hoaxes tend to have more traffic accumulated before their creation than after when compared to their cohorts. And this is consistent with the model which the supply false and misleading information is driven by attention. And probably the most important conclusion of all is do not create hoaxes. Like if there's only one point you get from this presentation, please do not create hoaxes. And if you would like to replicate our plots for generate our data, this is the GitHub link. And thank you all. And I'm looking forward to answering your questions. Thank you, Anis, for the great presentation. One question related to the selection on the media coverage. Could the discussion by media sources introduce a bias because there might be some very short-lived hoaxes that get picked up by some media. So did you check the characteristics of the hoaxes and the duration of the coverage? Not individually, but all of these hoaxes are successful in the sense that they were discussed. Not all of them are discussed, but some of them are discussed. But we didn't go individually into checking whether each one of these hoaxes are discussed by media sources. However, in our future consideration is to expand this to not only successful hoaxes and in non-English Wikipedia entries as well. Awesome, thanks so much, Anis. Now we move on to the lightning docs. This is the second round of those. And again, please put your questions for these lightning docs in the Google doc. We won't have time to take questions immediately afterwards, but you'll use that during the poster session. So, Miriam, if you could please start the lightning docs, please. Hi, my name is Nidia Armandes. I am responsible of data processing in Gai-Cid Conicet, a research institution in Buenos Aires, Argentina. On this video, I will present the first steps of a study related to improving Wikipedia's references. Wikipedia allows to automatically generate citations from a URL using a citation generator. These automatic citations sometimes present errors as you can see on the screen. Our research is part of Web2Cid, a project that is developing a tool for improving the results of the citation generator for URLs. We want to evaluate the performance of the citation generator, comparing the citation that it produces for a URL with the correct form of the citation for that URL. To do this, first, we need to find a set of correct citations having URLs and extract the metadata from them. We also have to obtain automatic citations for the same URL and measure the difference between the correct and the automatic citation. For the first step, finding a set of correct citations, we gather the corpus of 10,000 featured articles from English, French, Portuguese, and Spanish Wikipedia's. We isolated all the citation templates from the wiki code and we then extracted the information for a series of parameters, title, author, source, date, and of course, URL. So if you're interested in our findings regarding these first steps, please take a look at our long abstract of the site's conference. And if you want to know more about the script and the following steps, visit our Jupiter notebook on pause and join us on the conversation of the wiki workshop. Thank you. In the last few years, major social media platforms like Twitter and Reddit have noted the phenomenon of bannivation, a banned circumvention strategy that leads to temporally disjoint operation of two accounts. To study online bannivation, we have curated a dataset of about 8,500 bannivation pairs and Wikipedia, where each pair comprises a banned malicious parent and a child that was created to evade the ban and continue malicious activities. We formulate the bannivation lifecycle and address crucial early prediction, detection, and attribution tasks using machine learning models. Our models demonstrate an impressive ability to predict and detect bannivation. Additionally, our data-driven analysis shows that there are similarities between parent and child accounts in terms of edited pages and vocabulary used. Interestingly, some bannivators tend to hide by using fewer swear words and more objective language than their banned parents. Based on our research, we are working on a tool that would help Wikipedia moderators in evaluating suspected bannivation. Here's a demo of the tool by Zen and Jio. Let's say we have two users with quite different user names. The model informs us that these two users are banned in vision pair with a probability of 0.87. We want to understand why our model thinks that way. When we open the metadata dashboard, we can note that even though these two user names don't seem similar, they are actually editing the same Wikipedia pages such as Paul Rose and Victor Davis. For a closer look, we analyze the most similar sentences added by these two accounts and notice that they both mentioned two people with the same last name who were born in the same location and were involved in rebelling groups. Our system also captures sentences that discuss burials in symmetries as similar. It also allows visualizations across other categories like vocabulary overlap, cycle linguistic attributes, and sentiment scores. Thank you. And please stop by our Wiki Workshop poster and full presentation in the main conference. Hello, everyone. My name is Patrick Keogh. I'm a PhD candidate at Sobs at Monash University. Today I'll be presenting this early-stage joint research project titled Editing the Truth. In this project, we want to understand how and why governments may be interfering in Wikipedia. In particular, we're looking at the capacity of states to disseminate information in Wikipedia. And we're using government edit quality. In particular, the ability of government entities to adhere to the strict editing standards of Wikipedia to create a new measure of bureaucratic professionalism in the digital space before looking at the determinants of government editing behavior on Wikipedia. We start by creating a data set of 46,000 edits from 702 government entities in 83 countries. Create this data set using a database called VBIP which has ownership information for the universe of internet protocol addresses. We create a tool to query each owner in the group of knowledge panel and extract an anti-description for that owner. We then create a training data set of entity descriptions using the Wikidata taxonomy and classify each owner in the DBIP database as government or not government. Using our government IP data set, we match this to anonymous Wikipedia edits in all compatible language versions of Wikipedia and use this vandalism detection tools to measure the good intent of the government editor and whether or not the edit being made by the company entity damages the quality of the article. We find the percentage of higher quality and low quality edits using this measure are a good measure of bureaucratic professionalism in particular that states with higher education a greater proportion of female star and higher cybersecurity skill make higher quality edits on average. We then map the geolocation of each Wikipedia article being edited by the government. Here we see a map of government edits by United States government entities and we use this to uncover the terminates of government edits. We look forward to discussing this with you more in the poster session. Hello everyone, nice to meet you all. My name is Makola and my colleague is Diego and today I want to present our work Wikifog find semi-automated for checking base in Wikipedia. Even though the full automation of the fact checking remains unreachable today, any tool that can support the fact checkers with the manual work can be quite useful. In this work we concentrate on search for fact checking. We experiment with different manual search strategies that true claim label and the article's quality influence on the fact checking. We use the fever data set and process that for our needs living on the supports and refuse labeled samples also we actualize their article names. We use the rate of found items, the rate of correctly placed item on the first position to relate our results. According to our experiment, the searching for the raw claim in the Wikipedia search gives the good results, but the enhancement of the strategy by extracting the name entities and looking for them shows the great increase in the metrics. Another finding was that using the Google search and the raw claim query gives the comparable good results. Another interesting discovery was that searching for sources to refute incorrect claims can be more complicated than looking for correct statement evidences. But the strategy with query modification reduced that effect by searching for mentioned name entities instead of using the whole claim as a query. Also we observed that articles quality got from VP10 model and we found out that the distribution of articles of different quality differs on the different position in search. Also we decided to build the initial pipeline for ranking results to make the search more specific for fact-checking, using the quality information and learning to rank model that increase the recall one in our case. Thank you for your attention, looking forward to answering all your questions and speaking with you and stand with Ukraine. Hi, I'm Nathan DeBlunheis and I'm excited to be here at the Wiki Workshop to present my research investigating how to measure Wikipedia article quality in one dimension by extending ors with ordinal regression. This is from work that I presented at Opensim last year. Article quality measurement is an important topic and thing to do for Wikipedia community members to track knowledge gaps as well as for academic researchers to study important topics like political polarization and collaboration. Now Wikipedia has Wiki projects, the Wiki projects have members and the members do quality assessments that are really valuable that allow us to study article quality in a really good way. However, their assessments are limited and that they happen regularly in time. Articles can change between assessments and this leads to missing data. As a result, researchers have used machine learning to predict that missing data, the quality levels of articles that haven't just been assessed. The second limitation of the assessments is that they happen on a discrete scale. This is probably actually a good thing from the perspective of the people doing the assessments, but for statistical purposes, it's a little bit complicated because we might want to measure more granular levels of quality. Hathaker and others building on his work have dealt with that by basically combining the output of the ORS model into a single score ORS model actually outputs different scores for each quality level. This process of combining the scores depends on assuming that the different quality levels are roughly evenly spaced from each other. But I think that that assumption is doubtful because it might take a lot more work to raise a good article to a feature status than to raise a stub to a start level. And so in this project, I'm relaxing that assumption by using an ordinal regression model to combine the weights instead of just taking their sum. And doing this provides an improved level of accuracy on realistic research datasets. And we can also infer the different spacing between the quality levels. So this chart shows on the vertical lines like for different datasets, the quality levels that we would, the different quality levels. Yeah, so that's it. Thank you very much. Looking forward to chatting with you soon. Hello everyone, I'm Kai Zhu. So today I want to share with you our research in progress, kind machine translation narrow knowledge gap across languages. This is the very early stage of work. So any of your comments, suggestions and feedback will be highly appreciated. So in this study, we want to investigate the role of machine translation in content production on Wikipedia and how it could help with the knowledge gap issue. So we leverage initial experiment where Wikipedia integrate Google Translate into its in-house content translation tool in January, 2019. So content translation is this tool developed by Wikimedia Foundation to support its editor create an initial translation draft of an article from another language of Wikipedia by translation. So from January, 2019, you can use Google Translate to translate articles. So we have a set of research question that we aim to answer. We don't have the answer for all of them yet, but this is kind of the goal, okay? So the first question is about information is changed between different language edition of Wikipedia. The second question is about how the behavior of editor will change the way and the way they collaborate with machine translation algorithm. The third is about the diffusion of locally relevant and cultural specific content. We'll have some preliminary results. First of all, I will just say such clear trends after the integration of Google Translate in 2019. So there's a steady increase in number of articles created with content translation. Okay, so that's all the time I have. If you're interested, please reach out. Hi everyone, I'm Martin Mikkel. I will give you an overview of this paper called Wikipedia, Older or Teen. Co-author with Christian Consonni and David Laniado. Wikipedia is an undeniably successful project with an unprecedented number of online volunteers. However, researchers observed that the number of activators for the largest Wikipedia language editions start to decline in 2007. Years after those announcements, researchers and community activists still need to understand community growth. In this study, we inspected the temporal evolution of the number of activators, comparing the trends obtained for different language editions and performing clustering to identify in general patterns. We focused on the 50 largest language communities in number of activators in August, 2021. To group communities exhibiting similar temporal patterns, we applied K-means clustering to the time series and we used dynamic time warping to measure a similarity between the temporal sequences. We obtained these six clusters. Our results suggest that only half of them exhibit a pattern of decline and or explanation, while the others are still growing in the size of their age or community. This represents a significant breakthrough, given that it was widely assumed that communities were all in decline for not being able to maintain their number of activators, possibly because of a focus on English Wikipedia on other language, large language editions. To talk more about this topic, we'll see us in the Wiki workshop. Thank you very much. This paper, The Digital Gender Disparity, is part of a larger Wikimedia research project titled, Mapping Repositories on Gender and Sexuality in Indian Languages at CIS A2K. This research rise and continuum with earlier research done in the last decade at CIS A2K on gender gap in Wikimedia research projects. The gender gap and bias as noted in the existing scholarship lies in two forms. One is the participation gap and the editor composition, the second one being the nature of the content. This research focuses on the nature of the content that is produced on gender, sexuality, and feminism in Indian languages. Thereby, this research lies in three major themes, knowledge production on gender and sexuality, digital documentation of existing knowledge, and Indian languages. To understand the process of knowledge and content creation on gender, sexuality, and feminism, through this research, we have interacted with major stakeholders in the realm, such as writers, translators, educators, artists, producing content within the capacity of individuals and organizations. Two major observations that have been made in this research are going to be presented here. We have learned from our respondents that the digital space is no less to the actual world. Sociopolitical ramifications that exist in the actual world are also reproduced in the digital space that hindered the process of knowledge production on gender, sexuality, and feminism, which also leads to the lack of digital documentation of the existing knowledge on gender, sexuality, and feminism. We have also learned that social media has evolved as a space for creation of knowledge and content and its dissemination, especially by individuals hailing from socio-economically marginalized sections. The second finding of this research was the importance of locating a critique. We have learned from all our respondents that the knowledge that they produce is very critical of how gender operates and they have been very consciously deploying an intersectional perspective while they are producing knowledge on gender, sexuality, and feminism. Specific examples to this, as pointed out by our respondents, is the difficulty in translating conceptual vocabulary on gender and sexuality in Indian languages. The overemphasis of framing the feminist critique from the western or the Anglo-centric perspective. It is also pointed out to us from our respondents that the Wikimedia projects should also include these critiques and work with a similar intersectional approach while they are producing knowledge and content on gender, sexuality, and feminism. Thank you. This work is called Wikimedia and Gender, The Deleted, the Market, and the Unpolluted Biographies, created by Professor Nuria Ferran Ferrer, Professor Julio Meneses, and me, David Ramirez-Ortones. The gender bias in Wikimedia presents as a problem of three different kinds, editors, content, and readership. We focus on the gender content bias, specifically in the content creation and deletion process. We are working on English Wikimedia on biographies of scientists, the deleted, the market, and the unpolluted biographies. In this diagram, you can see different types of biographies after the evaluation process made by editors creating a spectrum, starting from the deleted biographies, those marked for lacking notability or reliable sources, and biographies without any mark or unpolluted biographies. We propose this methodology for the analysis of two corpus of data, biographies in the articles for deletion category, and biographies stuck to include reliable sources. In this way, we can cover invisible biographies for those who are not administrators. We consider that in order to solve the gender bias within Wikimedia, we need to understand the logic of the evaluation of biographies regardless of the number of biographies created. If we don't take this into account, despite that more articles are created, the rate of deletion or tagging may still maintain the imbalance and the gap will continue to persist. Thanks for your attention. Please get in touch with us. Hello, everyone. My name is Oktai Hassanzadeh. The work I'm presenting today is done as a part of a project at IBM Research with the goal of building an AI agent that could help with preparing for the future and helping the world plan for the so-called known unknowns. For the work we're presenting at this workshop, what we are exploring is whether we are able to build an event analysis and forecasting solution using the knowledge expressed in texts and Wikipedia articles about past events and their consequences. I have a very simple example here to show the high-level idea. If we go back to January 8th, 2020, we may be able to map the initial news articles covering the World Health Organization's announcement on a pneumonia outbreak with a known cause to relevant knowledge in Wikipedia. For example, the knowledge that SARS outbreak had similar events at the start. Then we can look also for what SARS outbreak resulted in, use that to predict some of the consequences of the new outbreak. For example, the effect on tourism industry and then oil and gas prices, which also happened for COVID. The question is, will we be able to do this faster than analysts and at a scale? Here's a high-level architecture of the solution we are building using Wikipedia and Wikidata and in part Wikinews. At the core, there is a causal knowledge graph of events curated from existing event-related knowledge from Wikidata and then augmented by knowledge extraction from Wikipedia articles. In this solution, monitoring ongoing news is done through mapping news headlines to event concepts in the knowledge graph and the analysis of the events is done through looking at past similar events and their causes and consequences as captured in the knowledge graph. As you can see on this example, Wikidata already has knowledge around major events and relations such as has cause or has effect, but of course, it's far from complete. At the same time, there are many relations expressed in text and Wikipedia, even the first paragraph of the article and what we do is automatically extracting such relations to augment the Wikidata-based knowledge graph. This is done primarily through a number of neural models for language understanding. If you come to our virtual session, I would love to go through some of the lessons learned so far and the work we can do with the Wikicommunity to address some of the challenges we have faced in this project. Good day and welcome at the presentation on considerations for our model on noun classes in Nile and Congo B, also known as Bantu languages in Wikidata. I am already here with the University of Cape Town and this is joint work with Lange Komalo from the Sadi Laag and Zola Matlaza from the University of Pretoria. The broader context is about the possibility of Wikipedia pages in the language spoken in Sub-Saharan Africa. Abstract in Wikipedia might help speed it up in creating those pages, but it relies on Wikidata for lexicographic data and it has very little of that for the NCP languages. The first key step is the noun class system as it governs the rest of the sentence. So here's a summary table of those noun classes and what kind of things can be found in each of those classes. For instance, class one in the single appears with two and for its plural for humans, whereas other animals go in noun class nine and 10 respectively. These noun classes affect the sentence construction consider for instance the adjective toll and the verb eats, which remain the same for the English, whatever entity it applies to, but for the NCP languages, there are concourse that depend on the noun classes of the noun in order to complete the adjective and the verb as we can see here on the slide. And we set out to collect those requirements for a model which are based on two premises, being linguistic soundness and bootstrap ability to order on a resource languages. So here's a sampling of those 14 requirements. First, it would be a mine-off system like shown in the previous tables that is key since it's the only one that satisfies those two premises. It also includes translation aspect and various optional features. The list of requirements in the abstract may be challenging to implement fully, but we are nonetheless trying to design a comprehensive model that is extensible. A first concrete action for wicked data would be to use mine-off system to cover the very basics, which then also aligns with existing natural language generation functions. Any feedback on the requirements is welcome. Thank you for your attention. ["Mine-off System"] Definitely the most musical workshop that you are ever going to attend. Thanks so much to all the presenters. I know that there was a lot of lightning talks, but you'll get a chance to talk to each of these authors, each of these presenters during the poster talk. Before that, we have an eight-minute break, and I believe during this time, get your coffees, and then we'll come back for the poster session. If you can stop the recording, Emily, that would be great.