 Good morning or afternoon or evening depending on where you are. My name is Miriam and I'm a senior research scientist at the research team at the Wikimedia Foundation. What I'm going to talk about in the next 25 to 30 minutes is a result of a set of efforts which lasted around one year that the research team did to build what we call the knowledge gaps taxonomy. Before getting into the details of the taxonomy itself, since I speak on behalf of the whole research team, let me introduce herself. This is the research team at the Wikimedia Foundation. There are six of us, Leila who's the head of research and five research scientists. We come from different backgrounds from physics to human-computer interactions, natural language processing, and computer vision. And we use large-scale data to support and understand communities of readers and editors of Wikipedia and sister projects. To start our journey towards the knowledge gaps taxonomy, let me start with the principle which most of us in Wikimedia spaces are familiar with. Knowledge equity is one of the two goals of the 2030 movement strategy from the Wikimedia movement. Knowledge equity is encouraging us to focus our efforts in including in Wikimedia projects the knowledge and communities that have been left out by structure of power and privilege. It's pushing us towards removing those knowledge inequalities and break down the social, political and technical barriers preventing people from accessing and contributing to free knowledge. Now, while we're all working towards reaching knowledge equity, what would be also nice to have is a way to measure how well we are doing in terms of reaching knowledge equity, a way to track our progress towards this end goal. One way to do this is to do what we call operationalized knowledge equity. This means to isolate and identify the individual components and gaps which prevent us from reaching knowledge equity and find a way to measure those gaps. By doing so, we can track our progress towards the end goal of knowledge equity. To understand better this concept, let me give you an example from one of the most well-studied gaps in Wikimedia projects. The gender gap has been investigated by researchers and practitioners for years. Researchers have found that a very small percentage of biographies in Wikipedia is about women, for example. And for the Wikidata human gender indicators initiative is tracking the percentage of female biographies across different Wikipedia editions. Similarly, a recent effort from the research team and collaborators have tried to quantify the gender gap in Wikipedia readership. Through a survey, they found that 75% of the page views we receive in Wikipedia are from men. And this gender gap is even wider when we consider gender distribution of editors. So as you can see from these initiatives, by quantifying gender gap, we can understand how well we are doing in terms of knowledge equity on the gender front. But how can we do this more systematically? How can we measure and understand all gaps at this level? Different initiatives have focused on specific gaps or have tried to address specific gaps. However, what we missed when we started this project was systematic understanding of the types of gaps that we have in Wikimedia. And that is why we build this taxonomy of knowledge gaps that can map the different types of gaps that we can find in Wikimedia spaces. We define a knowledge gap as a disparity in terms of coverage of a specific group of readers' contributors on content across Wikimedia projects. So for example, the gender gap fits this definition. The gender gap in content is a disparity in terms of coverage of content about different gender groups. The gender gap for readers also fits this definition. It reflects a disparity in terms of participation of readers from different gender groups. So our task to build the taxonomy was to essentially find in Wikimedia spaces those gaps that fit this definition. How did we build this taxonomy? If you're familiar with the concept of taxonomy, you know that the taxonomy is generally represented as a tree, where the root are a few coarser grain categories and leaves represent finer grain categories or particular properties of the coarser grain categories. So to start our construction of the taxonomy of knowledge gap, we first have to find the root of the taxonomy. And to do this, we read different sources from literature, and we were inspired by this diagram of engagement of the Wikimedia movement, which was built by the product teams in the context of the medium term planning last year in 2019. Consistently, across the different sources that we went through, three main actors were identified as the main pillars of the Wikimedia ecosystem. Readers, contributors, and content. And so given this consistency in having these three pillars as main actors of the Wikimedia ecosystem, we choose content readers and contributors as their root dimensions of our taxonomy. Next, we had to populate the taxonomy. And how we did this in practice is that for each of the dimensions, the root dimensions, readers, content, and contributors, we did a thorough literature review to find elements of the Wikimedia ecosystem having evidence of knowledge inequity. And we identified those as knowledge gaps. And then we organized those gaps in a structured taxonomy that can be easily consumable and intuitively understood by technical and non-technical people. So the next step after having identified the root dimension was to identify the gaps for each of the root dimensions. In practice, how we did this is that we went through tons and tons of literature to identify those gaps. There were three main different types of sources that we analyzed to identify those gaps. The first one is academic literature. We read hundreds of papers from communication science, social science, and computer science that talk about, try to understand or quantify different types of knowledge gaps in Wikipedia and assist their projects. The second type of source that we analyzed is a set of strategic recommendations that the Wikimedia movement has released earlier this year. And the set of initiatives that the Wikimedia community has worked on throughout the years to bridge knowledge gaps. Both these community sources give us an idea of the important gaps for the community. The third type of source is what we call community surveys. These are surveys that have been carried out by the Wikimedia Foundation affiliates and user group to understand the composition of their reference communities. These surveys are especially about the composition of the readers and contributors community. And by looking at the dimensions they analyzed, we could also expose some specific knowledge gaps. So the final structure of the taxonomy is as follows. We have three root dimensions, contributors, readers, and content. For each dimension, we have identified through literature review a number of gaps. Now, because the root dimensions are few and the gaps are many, we added a middle layer of the taxonomy, a layer of what we call face sets. Face sets are a group of gaps which refer to similar properties or objective. They share similar characteristics and they're very useful to summarize the taxonomy and make it shareable in a smooth way. So for example, in the next few slides, I'm going to give you an overview of the three main dimensions of our taxonomy based on the underlying face sets. So let me start with the readers dimension. Gaps in the readers dimension reflect different participation in readership depending on different readers group. There are three main face sets in the readers dimension. And the first one is socio-demographic face set. Gaps in the socio-demographic face set, as the name says, reflects disparities in readers representation based on social or demographics characteristics. It was pretty easy to compile the set of gaps in these specific face sets because there has been a lot of work around characterizing readers in terms of their socio-demographics based on community surveys. There were a lot of academic papers that try to understand readers' socio-demographics and there are a lot of initiatives across the movement that try to bridge socio-demographics gaps. And so at the end of this research through these three sources we identified seven main socio-demographics gaps. The gender gap, which as we said reflects difference in readership depending on one's gender identity. The education gap reflects different volumes of readership depending on readers' education level. Similarly we have the background gap which reflects the background of readers including the religious belief, the age gap, the local gap which reflects the distribution of readers across different geographic area, the language gap which is about the readers' fluency in a given language and the income gap which reflects different volumes of readership depending on one's income. The next face set in the readership dimension is accessibility. Accessibility gap reflects all those technical barriers which prevent people from accessing free knowledge. Accessibility face sets are inspired by the improved user experience recommendation of the 2030 movement strategy which essentially encourages everyone in the movement to build more inclusive platforms, interfaces and knowledge services that can be basically so that Wikipedia and sister project can be accessed by people with different accessibility issues including internet connection or physical disabilities. And so based on the recommendations and the ongoing initiatives we identified four different accessibility gaps. The physical disability gap which reflects different level of readership depending on one's physical ability. The device gap which is related to the type of device that people use to connect to Wikimedia sites. The technical skill gap which we have found in literature technical skills to be a blocker to accessing Wikipedia and sister projects. And finally the internet connectivity gap which reflects different levels of readership depending on one's internet speed. And we've seen a lot of initiatives in the past few years to bring Wikipedia and sister projects into areas of the world without internet connection. The last face set of the readers dimension is the information need face set. Gaps in this face set reflect different volumes of readership depending on readers information need. The construction of this part of the taxonomy was inspired by a set of work around understanding why readers come to our site. The latest work is called Why the World Reads Wikipedia and was published in 2018 by members of the research team and collaborators. And so this work actually has a specific inner taxonomy to identify different information need and we port this taxonomy directly into the information need face set. As a result there are three knowledge gaps within the information need face set. The first one is a information depth which reflects different levels of readership depending on one's need for deep information in Wikipedia and sister project. The second one is the familiarity. People come to the sites with different levels of familiarity to a topic and these gaps reflect that. The third one is motivation and the motivation gap reflects different volumes of readership depending on one's motivation to come to the site. Now this concludes the readership gap. Next we're going to talk about the contributors gap. This part of the presentation is going to be fairly short because the contributors taxonomy is pretty similar to the readers taxonomy. There is a sociodemographic face set where gaps reflect different levels of readership different levels of contributorship depending on one's sociodemographic characteristics. There is an accessibility face set and then instead of the information need a face set we replace it with what we call a motivation face set. Gaps in the motivation face set reflect different volumes of contributorship depending on the type of work editors come to do on the site. So depending on the role an editor has they might do different types of work and depending on the motivation editors have they might do different types of work and these two gaps reflect the different types of contributorship based on role and motivation. The final branch of the taxonomy is the content dimension. Gaps in this taxonomy reflect different volumes of content depending on different characteristics. There are three main face sets in the content dimension. The first one is the policy face set. To explain the policy face set I pasted I pasted a screenshot of the Wikipedia page about the core content policies. Most of us might be familiar with the Wikipedia core content policies. There are three neutral point of view verifiability and non-regionary research. For this presentation and the taxonomy of knowledge gaps we focus on the first two. Neutral point of view which requires that every article has a neutral perspective on the topic and verifiability which mandates that every material that is challenged or likely to be challenged has to be backed by reliable source. And we know that not all articles have the same level of neutrality and not all articles fulfill the verifiability policies in the same way. And so the policy face set contains these two gaps which reflect these two properties. Verifiability and neutrality. The second face set of the taxonomy is diversity. Gaps in this face set reflect different levels of content coverage depending on the topic represented. This face set is inspired by the identified topics for impact recommendation from the Wikimedia movement strategy which essentially encourages to develop and increase access to content that has historically been left out by structures of power and privilege. Similar aim at supporting diverse content creation we can find it in the strategic direction of the Wikimedia Foundation from the medium term plan which was developed in 2019. And we also find different organization beyond Wikimedia Foundation affiliates like Who's Knowledge, whose actual mission is to improve diversity of content in Wikipedia and its sister projects. So we identified, so based on the literature review and by looking at all these initiatives we identified four types of diversity gaps. The gender gap is the content counterpart of the readership and contributors gender gap and it reflects different volume of content depending on gender groups. Similarly the cultural context gap reflects different volume of content depending on the cultural background of a given article. And similarly the geography gap looks at the geographic distribution of articles or items in Wikimedia projects. And finally the impactful topics gap is looking at the differences in terms of coverage of topics that are considered interesting or impactful for different specific communities. The last face set of the content dimension is the accessibility face set. This is similar to its readers and contributors counterpart is inspired by the improve user experience recommendation of the movement strategy. And for this specific face set we focus on the accessibility of three types of content that we find very often in Wikipedia and sister projects. Three types of content are text, multimedia and structured data. The accessibility gap for text is the readability gap and it reflects the fact that not all articles have the same levels of readability. The multimedia gap refers to the fact that having visual and non-textual components in articles and items of Wikimedia project is important in terms of accessibility for younger generations or for non-native speakers. And so the multimedia gap reflects the proportion of multimedia content across different Wikimedia projects. The structured data gap reflects the different types and quantity of usage of structured data across Wikimedia project. We give importance to structured data in order to organize and manage large quantity of content in Wikimedia projects. Now this concludes the overview of the taxonomy of knowledge gaps. We have seen a very coarse-grain overview of what's in this taxonomy. At the end of the presentation I will provide you with a link where you can read a lot more about each individual gap and the sources that we use to mine this information. Now while this taxonomy has been built where do we go from here? How do we use this taxonomy? We identify three different groups of audiences that could be interested in using this taxonomy. The first one is researchers. We hope that researchers in Wikimedia projects or researchers who are using Wikidata from Wikipedia and sister projects will look at this taxonomy to either direct the future of the research and for example look at quantifying specific types of gaps or to use this taxonomy as a tool to understand the types of biases that they have in the data that they use to train their algorithm that comes from Wikipedia and sister projects. Organizations like Wikimedia Foundation and affiliates might want to use this taxonomy to address investments in different initiatives and events that address specific gaps or that address gaps that have not been explored before by the organization itself. And finally Wikimedia community might want to use this taxonomy to prioritize and or generate new initiatives about bridging knowledge gaps. For years already the community have had many initiatives to bridge knowledge gaps and this taxonomy can help support this trend even further. Finally let me talk about the future. I said at the beginning that the final goal of our work would be to have a tool to measure how well we are doing in terms of reaching knowledge equity. Now the taxonomy is essentially a first step to build a composite index which we would call the knowledge gap index that can help understanding and quantifying each individual gap so that we can monitor over time how well initiatives from different angles of the community are impacting our progress towards knowledge equity. To give you an example of how the knowledge gap index looks like take the gender equality index from the Europe Institute for Gender Equality. This index for each European actually for each country contains a series of indicators that monitor the progress towards gender equality. So the knowledge gap index would look something like this based on knowledge gaps to measure knowledge equity. And so with the knowledge gap index I'm going towards the end of the presentation before going to the end I want to thank everyone who made this presentation possible by making all the beautiful visuals available, the all the beautiful visuals that I use in this presentation available on the web. And I have two credit slides and a third slide containing the link where I took all the screenshots in the presentation from and with that I thank you very much for listening until now and I encourage you to give us feedback. This knowledge gap taxonomy is just a first step towards bridging knowledge equity and it's a first rough itself. So we would really like to hear from you any sort of feedback that you might have on individual gaps or individual faces, the root dimensions, everything that you want to tell us about the taxonomy you will find a way to give us feedback by following this link. And with that thank you very much for listening and looking forward to hearing your thoughts on the knowledge gap taxonomy.