 Hello everybody. Welcome to the February 2019 Wikimedia Research Showcase. I'm Jonathan Morgan. I'm going to be your host today and we have two exciting speakers today. Our first speaker will be Leisha He from the University of Michigan, who will be presenting on work that she has done with collaborators at Northwestern University and the University of Michigan. The title of her talk is The Tower of Babel.jpg, Diversity of Visual Encyclopedic Knowledge across Wikipedia Language Editions. And following her talk, we'll have Bramtin Yazdanyan from Ecole Polytechnique Fédérale de Lausanne with a presentation titled A Warm Welcome, Not a Cold Start, Eliciting New Editor Interests via Questionnaires. Each speaker will have about 20-25 minutes to give a talk. Then we'll have 5-10 minutes for questions. And Baha Mansurov will be fielding questions that are posed on our IRC channel, Wikimedia Research, and on the YouTube stream. With that, I think we're ready to go. Leisha, I'll hand it off to you. Okay, great. Thanks. I'm very excited for this chance to present our study on Image Diversity across Wikipedia Language Editions. So we all know that Wikipedia is a very large multilingual knowledge base that has many language editions. It is the interest of many to actually analyze the relationship between those language editions. When dealing with this type of relationship, people used to make two common assumptions. So the first assumption that people used to make about Wikipedia language editions is an assumption called all Wikipedia language editions are created equal. In this assumption, it basically assumes that Wikipedia contents are similar across language editions. If I have an article in English about the concept of chocolate, then I should also have a similar article in every other language edition. However, this is actually not true proven by previous study. Previous study have suggested that over 73% of concepts only exist in one language edition. Therefore, there's no way that all language editions are equal to each other. If we look at the size of Wikipedia, we will quickly realize that English Wikipedia is so much bigger than other language editions. So it is time to assume that English is just so big so that it covers other language editions. This is the assumption that we call English as super set. However, this is also not true according to previous studies. We think that this group of research is very exciting because that knowing the existence of concept diversity give us better ways to think and using Wikipedia information. For example, if I mind all the information from English Wikipedia and believe the results that I obtained could also be representative for the German language edition, then I'm probably wrong. So we really like this group of research. However, we did find one thing missing. That is, we don't know much about images. In fact, it is actually quite hard to predict whether there will be more image diversity than text or the other way around because of several well-motivated factors. On the one hand, unlike text content, images are likely to be stored in a centralized repository, Wikimedia Commons. This large multilingual media repository has made it possible to search and reuse images that are uploaded by contributors of other language editions. Therefore, likely to reduce image diversity across language editions. Also, unlike text, images might not always require active translation. Therefore, it could be easily transferred into multiple languages and reducing image diversity. On the other hand, previous studies have shown that the existence of different cultural preferences for images might potentially boost image diversity. Therefore, in order to investigate image diversity on Wikimedia, we have to gather data set and actually look at the result. So we started with a large data set that have 25 language editions. And in this data set, we have about 24 million qualified articles as well as 10 million qualified images that are used by those 25 language editions. With this large data set, we're going to describe and quantify image diversity at a different level. The first level we're going to look at is called language edition level diversity. At this level, we basically care about what is in or not in an edition. So in the text version, this basically means that we're going to gather all concepts and count how many editions that each concept would appear in. Previously, we have said that over 73% of concepts only appear in one language edition. If we look at the full chart, we will see that it's quite hard to find the global language concept, the concept that appears in all 25 language editions. So this is what we have done for text. So for the image version, it's very similar. Basically, we gathered all the images and count how many editions that each image appears in. Let's take a look at the first bar. So we find that over 67% of images only appear in one language edition. And about 14% of images appear in two language editions. When we look at the full chart, we realize it's actually really hard to find the image that appears in all 25 language editions. Out of the 10 million data that we get, we only have 142 images that appear in all 25 language editions. This result means that there exists extensive image diversity across language editions at this level. And to compare image and text diversity, we would say that the trend is very similar, except for small differences in the beginning and the end. Besides looking at this big picture, we can also compare image diversity within a pair of language editions. For example, we can take a look at the percentage of overlaps between English Wikipedia and Dutch Wikipedia. One motivation for making such pairwise comparison is to address the assumption that English is a super set of other editions or not. And see if English editions actually cover any other language editions. So we have 25 languages in our dataset that give us 600 language pairs. Let's take a look at the result. The highest coverage that we find in our dataset is between English and Korean. This means that 64% of images that's used in Korean Wikipedia are also used in English Wikipedia. If we keep going down the table, we will find that the lowest coverage in the entire dataset is between the pair of Slavic and English, meaning that only 2% of images used in English editions are also used in Slavic editions. With this pairwise comparison, it is safe to say that English as a super set is a false assumption at this level because it is missing at least 36% of images. So far, we have presented results at the language edition level and demonstrate the existence of substantial diversity in terms of what's in or not in an edition. We are going to take one step further and analyze diversity at within concept level. At this level, we care about what images are used to describe the same concept. To give you a brief example, basically we're going to gather all the articles that's describing the same concept, for example chocolate, and analyze to see if they're using the same images or different images. In order to measure within concept diversity, we did two steps. The first step is to align articles into article pairs. This means that we need to know that chocolate in Chinese and chocolate in English are describing the same concept. After the aligning process, we gather 7.6 million article pairs. Then we're going to calculate something called ratio of language 1 in language 2. It has a rather long name, but it's actually a very straightforward method that describes image overlaps between two articles. Let me give you a hypothetical example. Imagine that we have two articles describing chocolate, one in Indonesian and one in English. The Indonesian article used two chocolate images, and the English article used three chocolate images. They have one image overlap. In this case, the ratio of Indonesian in English is 1 over 2.5. We can also flip the order of the language and call English as language 1 or Indonesian as language 2. In this case, the ratio of English in Indonesian is 1 over 3. Keeping in mind that a lower ratio would mean less overlap, therefore, more diversity. A higher ratio would mean less diversity. Again, we have 600 language pairs. Let's take a look at the result. The highest ratio that we found is between Hungarian and Romanian. It means that, on average, Romanian article covers 76% of the content of the same concept Hungarian article. The lowest ratio that we found is between German and Indonesian. It means that, on average, Indonesian article covers 22% of the content in the same concept German article. This step indicates that there exists substantial diversity in how concepts are described visually. Besides looking at the lowest and the highest, we also want to highlight within concept diversity for pairs that involve English. The highest English-related ratio we found is 69%, and the lowest English-related ratio is 44%. So, this step means, at this level, English as super said is still a false assumption. So far, we have seen a lot of image diversity at this level, but how does this compare to text diversity? In this chart, we're going to plot the difference between image diversity and text diversity. The x-axis indicates 600 language pairs that we have, and the y-axis is going to show the ratio difference between image diversity and text diversity. So, if the ratio difference is close to the zero line, then the image diversity and the text diversity are about the same. If it's above the zero, that means there exists more text diversity than image diversity. If it's below zero, then it's the other way around. So, let's take a look at the actual chart. On this chart, we can see that most of the data points are actually below zero. In fact, there are only 22 language pairs out of the whole 600 language pairs that has more text diversity than image diversity. This suggests that visual diversity is, on average, greater than the observed textual diversity at this level. Besides those two levels of diversity, we also extended our research by going beyond the 25 language editions by asking the question that is there a true global language image? Just a reminder that we previously have defined global language image as images that appears in all 25 language editions that we have selected. We have also said it's actually very hard to find a global language image in those 25 language editions. Only 142 images out of the 10 million images we gathered appears in all 25 languages. So, is there any image that appears in our language editions? The answer is no. We searched through 287 language editions and found that there is no image that's used across all language editions. However, we did find that this image about a type of bacteria is the most popular image across Wikipedia language editions. It is used in 173 language editions. So far, we have seen the big picture for image diversity that involves millions of data points. And we think that they also suggest one major conclusion, a major insight that our study aims to convey, that we should be cautious of visual bias. If we only crawl information from certain language editions and think that our results are applicable across language editions, then we're bringing bias into our system and our model. As more and more image resources are being mined and used today, we hope to make Wikipedia content creators and consumers aware of the substantial image diversity and avoid bringing visual bias into our studies and usages. We think this high-level message is very relevant to mining and at the same time, we hope to make Wikipedia readers a way to examine the image diversity case by case. For example, a reader might be curious of how different languages are describing the concept happiness visually. In order to support that, we built an online tool called Wiki Image Dive that displays image diversity case by case. So on this Wiki Image Dive tool, you will be able to search for the concept of your interest as this example, happiness. And once you start the query, our system is going to find all the images that are used to describe the concept happiness and display it in a table. Besides the table, we also implemented a core diagram that's visualizing pairwise image usage overlap. Each band that you see represents a language edition. In this case, D is German, J is Japanese, and ZH is Chinese. If you click on the band, you will be able to see all the images that are used by that particular language edition. For example, if we click on the German band, then we're going to find this happy baby image that's used in the article and we're also going to see a happy gorilla. For chords, that's connecting two language editions. This means there are image sharing between those two language editions. So the particular chord that I'm pointing at is a chord between Japanese and Chinese. On this chord, we'll be able to find a lot of quick sculpture, for example, this picture of Ariel's daughter. On the chord diagram, you will also see chords that does not have any connection. So those chords represent images that are uniquely used in that particular language edition. For example, in the Russian happiness page, we find this Russian orthodox priest and this image is only used in the Russian article. With this tool, we will be able to make small case by case examples and also compare them. For example, on the left, I'm plotting a chord diagram for the concept Wikipedia. You don't have to read in detail, but basically this gives you an idea that how many images are shared across different language editions for this particular concept. And if we compare this chord diagram with another one, so on the right, I'm plotting the chord diagram for the concept science. We will instantly get the idea that while editors have a shared description about Wikipedia, editors from different language editions certainly have a different preference and understanding for the concept of science. Similar cases happens for pizza and hamburgers as well, and we encourage our readers to go ahead and try our tool. So while we have identified extensive image diversity across language editions, a rigorous understanding of what drives this diversity remains to be an open question. Using our tool, we have identified several potential causes for this diversity. However, we do believe that the biggest cause behind this diversity is probably culture contextualization, that different culture might have different preferences and understandings towards visual content. We believe that the Russian Orthodox priest exists in the Russian article for reasons. Analyzing the potential cause could be a very exciting future direction. Thanks very much for joining us, and the Wiki Image Dive tool is available on the website listed here, and I'm happy to answer any questions that you might have. Thank you, Lucia. That was fascinating. So first, I want to want to see authority questions from people here in the live hangout. You don't have any questions for Lucia. And then next, we'll go questions from IRC or YouTube. Yes, we have three questions so far. Awesome. I'm going to ask the first one. How do you measure text diversity? Could you repeat that again? Can you link to the Image Explorer tool? Sure. So the text diversity part was previously done by previous research, particularly there are a group of works that's done by the co-author, Ben Heck, and we directly compared our result with the text diversity extracted by previous research. And the link for the online tool is shown on the screen. So what's in comments.info slash isawsm. Thank you. And second question from YouTube also from Peter Meyer. Does the distribution of counts of appearances of a concept or image across language editions look like one of the standard named probability distributions? So we plotted the distribution and it looks like a long tail distribution. And I think we have more detailed description about the distribution in our paper towards the discussion section. And another question from IRC. Miriam is asking, do you see any way in which computer vision or image analysis can improve with this work? Can improve with this work? I think this particular work, we're trying to give a suggestion for people who are using Wikipedia information as their resources for training, for example, computer vision projects, that they shouldn't just mine information from one repository and then to bring this visual bias into their system. But I think if the potential projects are going to be, for example, reduce the number of image variants on Wikipedia, then this work could give some direction. Thank you very much. I think these are the questions we had so far. Okay, thanks. Awesome. Awesome. Thank you, Baha. Well, unless there are any other questions from the room, I have a few questions. So I think I'm just going to go ahead. So my first question, and I think this is something that I, so I think I'm thinking a lot about systemic bias within Wikipedia projects a lot lately. And you mentioned that it's important to consider the potential for visual bias if we treat the image corpus on Wikipedia articles as a superset. Can you talk a little bit more about what are some of the potential negative consequences of perpetuating this visual bias? So I guess I would say why does visual bias matter? Okay, yeah. So I think this question has been a very relevant question. So imagining that a particular researcher is trying to train a model, for example, descriptions of chocolate. So it wants to automatically recognize all the chocolates in the world. And unfortunately, this researcher only mined information from, let's say, Japanese Wikipedia. Then the model trained is definitely not going to be applicable or accurate for any other language context. Therefore, this visual bias is definitely bringing down the accuracy, bringing down the performance and might have more serious consequences if the target is not chocolate. Right. That makes a lot of sense. My next question is, and this is something that's, I think, I don't think you addressed it directly in the, in the presentation, but I'm curious whether you have data around it. So I wonder to what extent the, the, I'm trying to figure out a phrase this. So projects that don't share a lot of images with other projects. To what extent is this, is this a, is this a consequence of the degree to which that project uses Wikimedia Commons at all? Right? So images can be, you know, can be uploaded directly to various languages into Wikipedia, or they can be pulled from Commons. And I'm wondering if Wikipedias that tend to use Commons less overall also have more visual distinctiveness. Yeah. So we didn't specifically analyze this problem, but we did find that there are particular language editions that tends to use more of their images uploaded to their own Wikipedia base. And I think in my experience, I think Turkish Wikipedia is a particular example, but we didn't actually go into measuring like how much percentage of images coming from Wikipedia, sorry, from Wikimedia Commons or from their particular language edition. We did analyze some hosting structures, but because this isn't the particular focus of the paper, so we didn't test the stats, but I do think it will be an interesting direction to see if that impacts the diversity. Cool. Thank you. And with one final check of IRC, do we have anything new, Baha? Yeah, we have one question from YouTube. So it would be great to have a better elaboration about what is image bias and why, when it does matter, it does matter. Okay. So I think in our description, the image bias would be imagining that a thing or some kind of object or concept is described in a certain way without considering other potential possibilities. For example, when we are describing pizza or hamburgers, a particular language edition might have a specific way of thinking about hamburger, while maybe English have a different way. So assuming that the entire description for hamburger is the one like the English description, so treating English description as a golden standard would be a way to ignore alternative ways of describing it. That's how I interpret the bias. Okay. Thank you. I think that's it from me. Great. We have a couple more minutes, so I'm going to ask one more question. Okay. So this is kind of a broader question. What's next for this line of research? Okay. So there are multiple directions that we considered to go. So one direction was analyzing the image content on Wiki comments to actually understand what's uploaded to the comments and how are they used? So that's one possible direction. And we also discussed the direction of trying to understand the factors that drive the diversity. And personally, for my project, I'm trying to understand more towards the visual side. So I try to get information about different color usage on Wikipedia. So there are multiple things that we're working on. Well, I hope that you come back and join us again when you've completed the next step. Thank you very much, Lisha. Thank you. So next up, we have Ramtin Yosdanian. Ramtin, are you there? Yes. Hello. Excellent. And Ramtin is going to present a paper titled A Warm Welcome, Not a Cold Start. A List of New Editor's Interests Via Questionnaires. And with that, I'm going to hand it over to you, Ramtin. Okay. Thanks for the introduction, Jonathan. So let me share the screen. Okay. So hello, everyone. And I'd like to give you a warm welcome to this presentation, which is basically our research on eliciting new editor's interests via questionnaires. And so the first question you might ask is, so why new editors? I'm going to give you an introduction of why this matters at all and what our contributions are. Then I will go into more details of the methodology and how we evaluated it. And most importantly, an online test that we did with real Wikipedia users. So first of all, why do we care about new editors? So about 10,000 new editors join all Wikis on a daily basis. Very few of them stay. So at the same time, there's the priority in Wikimedia to diversify the editor base, which is known to be not as diverse as it could be. And so an obvious solution seems to be retaining some of these newcomers. And maybe one idea could be we give them personalized recommendations. Why? Because when I joined Wikipedia as a newcomer, let's say I joined the English Wikipedia, there are over 5 million articles. And I'm completely lost. I don't know what is out there. I don't know what I could contribute to. And I don't know whom I could work with. So as soon as a new user signs up, we need to give them some sort of personalized recommendation to try to keep them in the system. But the thing is, recommendations are easier said than done. Because all recommender systems suffer from a problem called the CodeStart problem. And this problem stems from the fact that most recommendation methods rely on user history. Let me, for example, give an explanation of the most famous one, which is collaborative filtering. I, as a user, let's say I've been on the platform for a while. I've edited a bunch of articles. Other people, other editors, they have also edited some articles. This method, this recommendation method, in some way finds editors who have similar editing interests to me. So we kind of share some sort of history. And it recommends other articles that they've edited to me. Now, the problem with newcomers, as you can see, is that we don't have any history available for them. So the personalized recommendations will actually not be personalized at all. And one common solution in the literature for this sort of problem is to automatically generate questionnaires based on the data that we have. And based on a user's response, responses to these questions, to the questions in the questionnaire, we have a profile which we can use for recommendations, either, in our case, either article recommendations or collaborator recommendations. There's quite a bit of literature, but the short version is that the user indicates whether they like or dislike a few items. And then the initial recommendations are made based on those likes and dislikes. And the main question that the literature tries to tackle is how should we choose these items, this small set of items, because we can't keep asking the user multiple questions. Now, one of the challenges for Wikipedia is that most of the existing literature relies on explicit ratings. Like in Amazon, you have, I rate this one star, I rate this five stars. We don't have any such thing for Wikipedia articles. All we have is that, for example, this editor has edited this particular article 200 times, so they probably like it. But this, in general, means that existing questionnaire generation methods are largely inapplicable to Wikipedia. Another problem is that not every article on Wikipedia needs contributions, and not every person is eligible for every kind of contribution. So there are multiple challenges. And this is where we come in. So our contributions, no pun intended, are, first of all, we propose a language independent question generation method to create questionnaires. This method uses topic vectors as its input. Now, I will go into the details of what they are later, but for now, consider them to be numerical arrays that assign a relevance value to each article, values that indicate how relevant each article is to that topic. And each topic represents sort of a semantically cohesive thing, a semantically cohesive concept. And I will also get to what dichotomies mean. These topics need to capture dichotomies. We'll get to that later. Now, as I said, the input to this method is a topic vector or a list of topic vectors. So in order to make this possible, we propose also three topic extraction methods. One of them uses the content of articles, another uses the editing history of articles, and the third attempts to combine both of them. And finally, we provide article recommendations based on the user's illicit profile. And we've evaluated, as I said, we've evaluated our article recommendations with actual newcomers to Wikipedia. We have a couple of bonus contributions here. First of all, the questionnaires that we generate are not limited to article recommendations. They can be used for collaborator recommendations as well. I'll get back to this at the end of the presentation. And the other thing is there's no need to import any data from any other platform. Everything is self-contained within this questionnaire, and so privacy concerns are satisfied. Now, let's get down to the business part of our contributions. So in general, we have this basic pipeline that consists of two parts. The pre-processing part is the part that we do before the questionnaire even goes online. What we do is, first, we extract the topic vectors, and we generate the questions. And then once the questionnaire is up, users can take the questionnaire, we take their responses, and we take the topic vectors, and we combine these to generate the recommendations for the user. Now, before we go into the details of what topic vectors are and what dichotomies are, and et cetera, I'm going to give you an example of what the system looks like in action. So this is more of a pre-alpha version of the system. This is what we use for the online test. And the final product will not look like this. But basically what you can see is we have two lists, list A and list B. These are two lists of articles. And as you can see, the list A is sort of mostly about singers, and list B seems to be mostly scientific, particularly economy. And the user is asked to compare these two. They are asked to compare these two lists based on which one contains more articles that they would be interested in editing. This is the rest of the page. And they can provide their preference level for this between these two lists of articles on a seven-level scale, as you can see, from greatly prefer B to greatly prefer A. And they can also say that they prefer neither. So basically the idea is they get a bunch of questions where each question is a comparison between a pair of article lists. Now, so let's get back to the pipeline that we had. Let's see it once again. So the first was extracting the topic vectors. What are these topic vectors? They're numerical arrays that have one entry per Wikipedia article. And for this, we've only used the English Wikipedia, but our basic methodology is applicable to any Wikipedia. The entry, each of these entries indicates the relevance of that article to the topic at hand. And it can be positive or negative. And dichotomies, what they're about is so these topic vectors will have a bunch of large positive entries and a bunch of large negative entries. And the dichotomy aspect is that these two parts, the ones with the largest positive entries and the ones with the largest negative entries, should form two distinct clusters, each of which pertains to a semantically cohesive concept. And this allows us to display sort of an approximation of what the topic vector, which is an abstract, mathematical construct in the way that you've seen in the form of two lists of articles. Now, once we have topic vectors, and we haven't yet discussed how we get those topic vectors, but bear with me, if we have a topic vector, what we do is we create these two lists, A and B. We take, say, 20 of the articles with the largest positive entries and call that list A, 20 with the largest negative entries, we call that list B. And then the question asks the user to compare these two lists and indicate which one they would be more interested in editing. And by creating questions using this method for several topic vectors, we get a questionnaire. And once we have the responses of the user, we calculate a linear combination of these topic vectors, where the weight for each topic vector is a numerical representation of the user's response to that question. And when you sum these up, you get a single vector for the user, which we call the interest vector. And then in this vector, the value, the value of the entry for each article is the relevance of that article to this particular user. So we've gotten from relevance of article to topic, to relevance of article to user. This is a graphical overview of what happens. In step one, which you can see to the bottom left, the user answers three questions. In reality, it's 20 questions, but they give their answers. In step two, these answers are converted to their numerical representations. And then in step three, we calculate a weighted sum. And this forms the interest vector for the user. And once we have that, we know how relevant each article is to the user, which allows us to just say, select the top K articles and show them as our recommendations. Now, let's get to the part where we actually get the topic vectors. So this is all good, but as you could, as you could see so far, our method is highly dependent on the quality of the topic vectors because it was all about building on top of them. So we have three methods, and these are sort of competitor methods. In one, we use only the content of articles, the textual content. In the second, we use the editing history. So for each user, we know how many times they've edited each article, and that basically constitutes our data. And we perform matrix factorization, and I'm not going to go into the details of that. I'm going to discuss more what each of these sort of mean. So the good thing about using article content is that the textual content of articles captures clearer semantic relationships. At the same time, there's the issue that a single word might have multiple meanings in different contexts, so that causes different articles, articles that are fundamentally different, to seem more similar than they actually are. The editing history has the upside that it's by existing editors, which are sort of similar to our target audience. But at the same time, there's the issue that not every edit is about adding a new section or a new paragraph. A lot of edits are things that don't take a domain expert, and are things like typo correction, grammatical errors, these sorts of things, which don't necessarily have semantic cohesion. And if a user has particularly focused on those, then their edits will not constitute a specific semantic. And the good thing about combining these two is that we still get the rich content data and the semantic relationships. But since the editor data is sort of fundamentally different to a certain degree, it allows us to disambiguate the policy that we have. And what we get is more human readable questions. Because one of the most important things here is that the user needs to be able to understand what the question is asking them in the first place. So again, if you have questions about the technical details, we can discuss them later. But for now, I'm just going to skip them. Now, of course, there are several important questions. First of all, are the questions actually human readable? Well, as we saw in one example, they do seem quite human readable. And we've, of course, we've manually checked. And we also have numerical sort of metrics for that. And also, how well does it perform? Especially with real people. So with actual newcomers who are the target audience, taking the real questionnaire and getting recommendations. So we have three aims. First, of course, are they human readable enough? And specifically, is our joint method combining content and editing history, is it better than the other two? The second thing is, so as you've seen so far, what we had, it was quite a complex pipeline. And so do we actually outperform simple baselines like baselines that say, these are the most viewed articles in the past month. And maybe you'd want to edit this or these are the most edited articles of the past month. So do we outperform these baselines? And the third thing is, so as I said early on in the presentation, we don't have user histories. And that means that traditional recommendation methods like collaborative filtering are out of the question. But let's assume that we did have some history on the users. How close would our method be to the performance of collaborative filtering? That's the third question. So we have two baselines. As I mentioned, recommending the most viewed articles, which I'm going to call ViewPop from now on, recommending the most edited articles, which I'm going to call EditPop, and collaborative filtering, which we'll call CFBase. Now, one part of the offline evaluation that I want to discuss here is the cohesion of questions. We have a metric for question cohesion, which we defined as the similarity of those top 20 and bottom 20 articles, that list A and list B, the similarity of all pairs of articles within each list. And what we can see is that our joint method definitely outperforms the content only and editing history only approaches. I'm not going to go into the offline recommendation performance because we don't have enough time and the online test is much more important. So let's get to that. So first of all, the target audience are newcomers. So we only took users who had joined Wikipedia, the English Wikipedia, from September 2018 to December 2018. Then we divided them into two groups. One group, complete newcomers, people who have next to no history, and relative newcomers, people who have already made some edits. These two groups get different recommendations because the complete newcomers have no history, so we can't give them any collaborative filtering based recommendations. So the complete newcomers only get recommendations based on their responses to the questionnaire and view pop and edit pop. The relative newcomers also get stf-based recommendations. Now the idea is that they would rate the recommendations because the first step, if we want to retain them, the first step is that they should actually like the recommendations that they receive. If they see the list and they are not satisfied with the results, it's not very likely that we would be able to keep them. So each user received six pairs of article lists. Each list had five articles and each list was generated by a single method, like one list is only generated by view pop. And each pair involved comparing one list of questionnaire-based recommendations with one of the baselines, one of the popularity-based ones, or with the ceiling, the CF-based ones. And we asked them three questions. Which of these lists in the pair? Which list would you prefer for reading? Which list would you prefer for editing? And which list has more articles that you are not interested in at all? So the first two questions are to separate reading from editing because that is not necessarily a separate thing in people's minds. And the last one is to make sure that we are not giving them completely unsatisfactory recommendations. Now we executed this by sending out participation tokens by email and we had pre-generated the CF-based recommendations to make things go faster because they might just leave otherwise. And we had 20 questions in the questionnaire also to keep the users from getting bored and simply leaving. Now I'm going to present some of the results here. In these tables, in these three tables that you see here, we have turnerized the results. What does that mean? It means that a win is when our method in one pair, our method is judged to be better than the other. A draw is when they're judged to be equal and a loss is where the other method beats our recommendations. Now we got a total of 279 participants, 180 of which were relative newcomers. And we present the results separately. The reason these numbers sum up to more than what you'd expect is because if you remember, each user got six pairs while we only had two baselines and one ceiling. So what you can see here is that pretty consistently our method beats the baselines and is beaten by the collaborative filtering ceiling. An interesting thing, however, is if you look at the uninterestedness table, you'll see that our margin of defeat is much smaller in this case, which means that even though in some sense collaborative filtering-based recommendations might be generally more interesting, when it comes to not recommending bad articles, outright bad recommendations, we are not that far behind collaborative filtering, which is a very high bar to beat. But one thing that you can still see is that we still do lose quite a few times against the baselines. So we try to investigate that as well. And we define dissatisfied participants, like strongly dissatisfied, as people who preferred a baseline over recommendations made by our method in over half of the comparisons. And our question is why are they dissatisfied? Now, based on how the system actually works, one of the possible reasons which is sort of our hypothesis is that the system gets sort of confused if you give it too many preferential responses. If you remember, we had seven levels in the responses. It was two which were great preference, two were moderate preference, two were slight preference, and there was one option for no preference. And if you give too many preferential responses, the system seems to get things mixed up. And the data seems to support this to a certain degree, but not too strongly. We still don't have enough data to say that with certainty. But what we do see is that if we calculate the proportions of non-preferential and preferential, so the first column on the left and the three other classes on the right, if we compare these among the satisfied users and the dissatisfied users, we'll see that the satisfied users seem to have significantly more non-preferential answers. This could be attributed to the dissatisfied users also being less sure. But given that we don't have enough data to support that, we would like to stick with the hypothesis that this is mainly coming from our own system. So that's one area for future work. And so that's basically it with our results. So let's sum things up. The system works and it beats the baselines. And in some cases, in particular in case of not recommending unsatisfactory articles, we have a performance that's quite close to the ceiling. Too many preferential responses seem to reduce user satisfaction or actually the other way around. More dissatisfied users have seemed to have more preferential responses. And one of the things I brought up at the beginning was that not every article needs contributions. But the thing is, given how our system operates, all we need to recommend articles that do need contributions is a list of stubs or generally lists of articles needing contributions, which means we can recommend articles that are both interesting to the user and need contributions. As I mentioned earlier, recommending articles is not all. We can also match newcomers both with other newcomers and with veterans, which is something we did try and we did not quite succeed, but it remains an open direction for the future. And another direction for the future is so what we did was a first step. We checked if the users were actually satisfied with the recommendations they were getting from our questionnaire. But next up is measuring retention. Do we actually get higher retention if we give them personalized recommendations? And that is an open question. And so with that, I'd like to thank you all for listening and let's have some questions. All right. Thank you, Romton. First up, I want to open it up to other folks who are here on the live call. Any questions for Romton? Okay. Next, go ahead, Marshall. It took me a second to get the mute going. Hey, thanks for the presentation. I'm Marshall Miller. I'm the product manager for the growth team at the Foundation. And so this work is really relevant to ours because what our team is focused on is increasing new editor retention. And one of the ways that we approach that is with the hypothesis that new editors are more likely to start editing and keep editing if they're able to accomplish the goal that they have in mind when they first arrive at the wiki. Some of the research, the Foundation's done in the past has said that a lot of newcomers arrive with something specific they're trying to do and if they're able to do that, they stay. And so our team thinks about how can we identify what users are trying to do and what they're interested in so that their experience can be personalized more toward what they need. And that's why we're interested in being able to identify user interests and in general know more about what they're trying to do when they arrive. And so I wanted to ask you, because we're thinking about these things from the practical perspective of how to implement them on wikis, do you know how long it takes users to get through your questionnaire? And did you learn anything about their patience with it or were they're interested in it and whether they were excited by it or engaged by it or whether a lot of them would drop off? And related to that is whether you have any ideas or hypotheses about how to increase a user's engagement with it or make it shorter or whatever you might be able to do to it to make it more likely that a user completes it. So first of all, regarding the so this this work was more focused on getting their interests in terms of, you know, sort of topical interests. So we hadn't focused on the aspect that you just mentioned, which is quite an important aspect. But I think the methodology for that would be quite different. We would probably be doing a survey of what would you like to do on Wikipedia similar to the work that has been done on why people read Wikipedia. So regarding how interested people are in filling out the questionnaire. So the test that we first did, it didn't involve article recommendations, it involved pairing newcomers together. And what we found was that people were not interested. So what happened was that we got a lot of people answering the questionnaire, very few of them actually interacted with the people that we matched them with. So their sort of their patience doesn't seem to be at a very high level. What I think could be useful is so 20 questions, the current questionnaire, I believe you can go through it in less than 20 minutes. Because if you don't feel like reading all the article names, you can just take a look at the word cloud and go with that and still get good results. But it is still possible to reduce the number of questions. However, I don't really know if we didn't collect additional like textual responses on how much did you like the questionnaire itself. Our focus was how much did you like the recommendations. So I don't have a complete answer to that. Okay, thank you. Excellent. Any other questions from the room? If not, let's move on to IRC. We have three questions from YouTube so far. So the first question is by Hoca Medvin Turkey. He's asking how can this method be used for recommending experienced users who can support newcomers, for example, giving advice to the ring, etc. So basically an experienced user could simply also take the questionnaire. And another thing, so basically what happens is that if many experienced editors take the questionnaire, we could basically have a large set of veterans with whom we could pair newcomers. And right now what we have is only the questionnaire. So everyone has to go through that even if they're experienced. But we could potentially also match them using the interest vectors. And the interest vector of an experienced editor could be calculated through other means as well. So to answer that question, it's quite possible. And they might not necessarily have to take the questionnaire the experienced users for that. Okay, thank you. The second question is also from the same user. Can this method be expanded to recommend writing styles or reference sources to newcomers based on the experience of similar users? So I think that it would need quite a few changes for that to happen. Not in its current, not in its current form. No. Okay, thank you. And the last question is from someone on YouTube too. I believe that it is possible to know how many watchers that Wikipedia article has. Do you consider that it might be good if the article has few watchers so that few people will revert the edits of the newbie? I'm not sure if it was asked during some of your comments or so. So the issue of reverts, as I'm sure you all well know, it's a completely separate problem and it's been investigated before. So we could technically incorporate that too. Although I'm not entirely sure of its utility because so two of the challenges that I mentioned was that first, not every article needs recommendations. And second, not every person is eligible to make any kind of edit. And we don't address the second one at all. So I'm not really sure if we should sort of fiddle with that by recommending articles that wouldn't be watched so much. That's all I have. Leila, do you want to jump in and expand on that last point? Yeah, I just want to quickly say that Bill Dunwood-Drompton already responded to the last question and say that there needs to be a lot of care going from understanding what the user interests are to starting to do actual recommendations to the user and say, do specific edits on specific articles. So at the moment we are not focusing on the latter part, we are saying these are the sets of articles or these are the sets of items that we think the user is interested in. What the user should do or should consider doing will depend on their experience, their expertise handling like technology, how much the community that the user is contributing to is willing to accept mistakes. So these are things that need to be coordinated. I would say with the specific language communities these technologies will be working on and a proper product will need to be built on it. I think we are far away from that. All right. Thank you, Leila. So we are officially just a little over time here, but both of the speakers said that they were able to stick around for a couple more minutes. So for anyone who wants to stay a little longer, I'll open up the floor for questions for either Brompton or Alicia. I think we have one question on YouTube. Would it be a good idea to recommend articles with many or with a few watchers? So regarding when it comes to the maintaining the high quality of articles, but I think recommending high watch account articles would be a good idea because we want something because as I said, we don't address the domain expertise part of it at all and we also don't give the user specific tasks because that's sort of outside the scope of this part of the project. So something like you should go there and edit and fix typos, or you should go there and fix grammatical errors. So yeah, I think high watch accounts would be good. Low watch accounts not so much. Excellent. Well, I have a question for Brompton and although I was involved in this study, I have a very poor understanding of machine learning and also a poor memory. So I think this is a legitimate question. So this also builds on the question around incorporating other sources of signal like watch lists and also Marshall's question about what would it look like to streamline the questionnaire to potentially get more people to give to provide input. The question is how can you see a system like this getting better over time, providing better recommendations over time, leveraging, say in a production context, leveraging all the responses that people have given so far? That's a very good question. So one thing I can think of is that that if the system has gotten a lot of sort of newcomer profiles and then we over the course of say a year see what they end up editing, that could be used to sort of give us an indication of what happens within the system because we might have a different perception of what these questions mean compared to what they end up meaning in the real world. So yeah, I think that's one of the ways it could improve. Also, I mean, there are multiple things that could be tweaked. For example, we have tried to diversify the recommendations that we give so as not to give the user the same thing over and over again, like different characters of the same series five times in a list. And that's sort of a one-off thing. That's not something that improves with the data in the system, but it is one of the ways we can improve it. But definitely over time as we get more data on users that end up staying, we're going to be able to tweak it to have better performance. Thank you. So let's take one more question. First, I want to open it up to the room and then we'll see if there's anything I can see. I have another question, but I know I already asked one in case there's someone else. Well, in that case, let's give somebody else a chance. All right, Pine. Baha, do you want to relate that? Yes, Pine on IRT is asking, is this questionnaire system extendable to encouraging newbies to contact active Wiki projects that have high levels of interactions with the types of articles that the user is interested in editing? Yeah, of course. I mean, with the idea of also experienced editors participating in this questionnaire as part of sort of a mentoring program, yeah, that would certainly be possible. Excellent. Sorry, I had to move out of the room because there were people jackhammering. I think we will call it for the day. Thank you to both of our presenters. And also, thank you to Baha on IRC and to Brendan Campbell-Credigan on AV and Jan Aladin on scheduling. And we will see you all, hopefully, next month at the Next Research Showcase.