 Good morning. Good afternoon everybody. Welcome to the December edition of the Wikimedia Research Showcase My name is Daria and I'm joined here by a number of awesome people Today we have a special edition of the showcase will be using the entire hour to talk about some new exciting research That I'll talk to you about in a second before I do this. I also want to give you a short service announcement we just announced the Upcoming wiki workshop wiki workshop 2019 It's happening at the web conference in San Francisco in May the 13th or 14th of May and the corporate papers is just a Open as of yesterday. So please send us your contributions. The deadline for archival papers is January 31st 2019 so we really look forward to seeing you there and you can follow the Twitter handle wiki workshop if you want to have more information with that I'm delighted to introduce today two of the co-authors of this study Florian Lemenrich from Aachen University and I'm joined by Leila Zia from wiki research here at the Foundation and I'm very excited about this piece of research. This is an extension of previous research that The team has conducted with our collaborators in the past and it's I think it's hard to say is that the first a series of comprehensive studies that we have on wiki me a readership looking at a combination of both quantitative data from our access logs as well as the data from quality surveys and responses for readers So very excited about the results that we're going to hear about as a reminder We also have a live Q&A on IRC Jonathan Morgan is gonna be our host there and Without further ado Florian you can take it from here. Well, thank you for having me here. Actually Leila will start and present Oh Yeah, okay Thanks so much. Just give me a second. I'll share my screen Good, okay, so I'll be Talking about for the next few minutes about kind of the first part of this presentation about why the world reads Wikipedia This is a joint work as Dario said with Florian Lemenrich, Diego Ziaz and Bob West and institutions That are our formal collaborators EPFL and RWTH of the University So for those of you who are familiar with this the first line of this research, you have heard this line 6000 page views per second is a number of page views that we see by humans to Wikipedia And this is the direct use of content that is being created by editors on Wikipedia project across languages Now, obviously there are a lot of questions that we have about the people that are behind these page views Basically, who are they? What are they trying to achieve when they come to Wikipedia? How do they learn? What languages they read in and many more questions And what we do in this line of research is try to answer these and the many more questions that you see here The answer to these questions are important for for us because it has content implications and by the way, I should clarify when I say us I broadly speak as And basically as we keep medium Wikipedia movement. So the content is as you know, it's primarily contributed to and handled by editors And who the audience for these the content that gets created Is can be important for editors to know and to be aware of It also has implications on content representation not only on what content we put there On the projects, but how do we put it? And in what form and structure and in what forms? Whether it's text whether it's image so on and so forth understanding these and coming up with Plans for how we represent content can rely on understanding readers better Um, it has impact on tool and feature development So obviously understanding better who the audience is again can have impact in these areas It also can have impact on the policies and these are both Internal policies that Wikipedia and Wikimedia sets for itself But also in the policy negotiations that we do With outside of Wikimedia entities And for those those of you interested in the strategic direction language This bottom line basic understanding readers will address some of the questions that we have around knowledge equity Who has access to content or type of content? What are the content is available in their languages in the forms that they can have access to and understand? and whose Whose access is currently not possible and why? So these are kind of general topics and reasons we are that make us interested about this line of research And today we are going to focus on some specific questions that we will talk with you more about What we're hoping is that by the end of this talk you can answer the following questions Our English Wikipedia reader behaviors and distribution of use cases representative of reader behavior in other Wikipedia languages I think this is an important question for us to answer because a lot of research in the past years Both in terms of editorship and readership has focused on the English Wikipedia And because English Wikipedia has a lot of requests readership Basically requests from across many countries One can assume that by understanding English Wikipedia You can understand all Wikipedia languages and the the needs and motivations of readers behind these languages So I'll do spoiler alert here And I would say what you will see by the end of this talk that English Wikipedia is not representative and go and we'll go into details of what we have observed And given that English Wikipedia is not representative the next question would be are there commonalities between reader behaviors and distribution of use cases across Wikipedia languages and are there differences and Well the answer to both is yes, and you will see more specifics in the following slides There are questions around do people read long articles or do they often come to Wikipedia to do quick lookups For those of us who reside mostly in northern America and western europe There's a we see based on our day-to-day experiences that there is a lot of going to Wikipedia and doing quick lookups and there's a natural question about Is our audience more and more transitioning towards these short forms of Learning or whether there are actually people who are reading the long articles Are the readers that reads science education research and medicine articles more than others and if there are Who are these people? What are their characteristics? What can we know more about them and more? So we answer these questions by basically using surveys What we do is that we use quick surveys, which is a feature which is an extension on Wikimedia projects and basically Here is how it works the user comes to Wikipedia. Let's say using an external search engine They come to a Wikipedia page. They start traversing from one page to the other and somewhere Through their experience if they have been sampled to participate as part of the survey They will see a widget. I'll show you in the following slides What exactly they see but they basically see the survey questions And then after that they have they have basically the option of continuing their Their experience on Wikipedia or their trajectory What is special about this? This line of study that we do on why we read Wikipedia is that we don't stop at the survey level So what we do is that we collect the responses from readers through surveys And then we also match those or connect those responses to web request logs And these are the logs that get quite collected and for a period of time For a short period of time basically we have access to and these logs will tell us The IP address of the request the user agent, which country the request has come from Which language Wikipedia language this this Request has been Has has been has requested This request has been request. Sorry. I missed all of it Which Wikipedia language this request has gone to So on and so forth We also in this new line In this part two of this research we we enter Country level statistics So we basically because we know which country the request has come from We can look at the external databases and see what else we know about the country of the user And try to answer questions whether The country that the user is in has impact on the kind of motivations that they have when they come to Wikipedia So I would say these two web request logs and country level statistics are what make The survey responses really rich for us We specifically use web request logs for Two things that I want to mention to you before we go to the results One is that we use that for Debiasing the results and we don't go through the details of this here Happy to discuss this in the discussion part of this presentation or you can read it up more in the papers But we need basically these more granular Features in order to devise the the survey responses that we collect as much as possible We also need web request logs because we are interested not only in knowing What are the distributions of readers? across different motivations But we are interested in characterizing their responses And just to give you an example of what we mean by characterizing While it's important to know what percentage of Spanish Wikipedia readers Come to Spanish Wikipedia because they are motivated by work or school projects It is not interesting for us to stop there once we know that number What we need to understand in order to have a deeper understanding of these users and their needs Is to understand what are the fundamental features that would allow us to characterize Wikipedia students in Spanish Wikipedia And we will show some of the examples of what we mean by characterization in the results So let me tell you about the survey I'm not going to go through this taxonomy of Wikipedia readers But we use basically the taxonomy that we developed in the first part of this research And this taxonomy Will allow us to ask three questions from the readers So we will prompt the reader by telling them why are you reading this article today? And this prompt will help the reader focus on The specific experience that they have on the article that they are reading right now And from there we ask them three questions. We ask them about their information need So they will basically ask this question I'm reading this article to look up a specific fact or to get a quick answer Get an overview of the topic or get an in-depth understanding of this topic We ask them what their prior knowledge So we ask prior to visiting this article whether they are familiar with the topic and are learning it Whether they are familiar with the topic or whether they're learning it here on Wikipedia for the first time We also ask them about their motivation and here they can choose multiple responses So the choices that they have is I'm reading this article because the topic was referenced in a piece of media I need to make a personal decision based on this topic I'm bored and randomly exploring Wikipedia for fun. The topic came up in a conversation I have a workforce group related assignment I want to know more about the current event This topic is important to me and I want to learn more about it And the last one is an other field Where they can enter basically in free form if The motivation was not captured as part of these options that they're offered to them And here is the mechanics of how the survey works on english wikipedia. Here is how it looks You have a desktop request or a mobile request And depending on which one it is the quick survey widget that the users will see On the desktop will show up on the top right hand side of the page and on the on mobile is when we basically under the info box The prompt that the user sees is answer three questions and help us improve wikipedia And they have basically three options here Four options ignore visit the survey no thanks Or read the privacy statement of the survey We ran the survey for a period of one week June 22 to 29 in 2017 And 14 languages were included. These languages are arabic, vengali, chinese, dutch, english, german, hebrew, hindi, hungarian, japanese, romanian, russian, spanish and ukrainian The choice of these wikipedia languages Come from two places one is that initially we put together a set of four or five languages That were chosen based on their script But also the language families that they represent and we wanted to maximize kind of the diversity of the languages that we include The additional languages were added by requests from the community There was a discussion on wikipedia l about which other communities want to be included So that's how we ended up with these 14 languages We surveyed readers on mobile and desktop platforms So no app The sampling rate Very heavily across languages. So on english wikipedia we sampled on Every from every 40 requests we sampled one of them and in vengali wikipedia every single user Reader basically saw the survey for that period of the week And the choice of the sampling rate was basically based on the traffic that comes to that language and the need for having enough responses So that we can make assertive conclusions about the data that we and results that we see The survey was shown on article pages and to those users we do not track off And we collected Slightly or over 215,000 responses across these 14 languages The data that we have is basically the survey responses. So the motivation information need and prior knowledge We have information about the request from web request logs. So we know the country continent Weekday and hour where the request was form was requested in the host and the refer class We have information about the article because we know which article the user was on when they saw the survey We we can compute certain characteristics for this For features for this article, for example the in-degree out-degree page rank text length page views topics and the topic entropy And lastly we can create sessions for the users and then through that We can understand not only what happened when the user responded to the survey but within the session Within the service session. So we look at the session length the session duration average dual time average page rank difference average topic distance refer class frequency session position number of sessions and then number of requests and we use all of these features to Basically look at the raw responses from the survey D bias them characterize the responses And that's it So with this I will pass it to Florian to talk about the survey results Yeah, thank you Leila for this introduction And now let me guide you through the actual survey results. So basically what we find out with this kind of data So let me check does the screen sharing work? Yeah Okay, great so With respect to the results, let's start with a short outline. So basically what we try to figure out here was like four different types of results So first of all, we wanted to do a robustness check with respect to the predecessor study to basically find out Well, how vulatile is this reader motivation in wikipedia? Then I will guide you to like the direct survey results. So basically what did the people answer? And then we go a little bit more into detail into two different directions. The first one is What happens when we can correlate the survey results with specific behavioral patterns that we find from the crest locks And then as Leila already hinted to people also look at Can we find interesting correlations of the survey results with respect to country level data? And as a kind of disclaimer before we really start with things well Just I want to note that ever reaching over all the languages which we did the survey in probably Not enough in general because this will hide especially this heterogeneity That is I think one of the the big classes of this study Okay, but without further ado, let's start with like the first part that is like the robustness of the results So what you can see here in this plot is basically for all of the different survey answers The results that we got from the 2016 study on the english wikipedia And what we now did get for the part of the english wikipedia for the now current study which we did in 2017 And the the intuition for this plot is that we can just check Is there a big volatility between these two years have there been major shifts or anything? And I think just visually if you look at me with to this plot Then we'll probably also assert that in average in general The survey answers that we get in 2016 and 2017 are Relatively robust. We see some minor deviations with particular here in this motivation work school part Where we have a decrease from 2016 to 2017 which might be related to the seasonality So we did the 2016 study in march and 2017 in june But I think overall we kind of agree that this study results are pretty Robust with the shift of the bond year and that also gives us let's say more confidence in first of all the research methodology that we use And second also that what we actually measure In our study is is not just a fluke or a statistical Deviation that just randomly appears, but that there is like really substance in what we can find out here Okay, so for this robustness, I think overall we can do a big green check mark So let's go directly to the second part of the results And that is well the direct survey results would just like well, but you immediately get out from the surveys Let's start with the first question which is about the motivation And what you see here as a result plot is like these group bar charts Each group of these bar charts Reflects the results of one of the languages the results that you see here has already Been de-biased by the procedure using the request logs that leila shortly mentioned And for example, if you look at like the for each block the leftmost bar of each block the blue bar These are the respective percentages that reflect for example the one survey answer intrinsic learning Also, as you will notice here for this specific question These bars will not add up to one and this comes from the fact that for the motivation readers could Just check different motivations as their reason to go to wikipedia Okay, and if you now look at the results, what can we see as a result? Well, the first thing that you probably notice is there for all of the languages There is like a right range of motivations So there is not a single wikipedia edition where you can say people go to this wiki edition Just for this one reason The second thing thing that you can notice is For almost all so that is all but three languages the most often mentioned reason to go to wikipedia is intrinsic learning And well that is in general also more common for the central and asian eastern european and central asian languages For the three languages where intrinsic learning is not like top motivation. That is english dutch and japanese Here is looking up something that has been mentioned in the media is the top answer What you can also observe is that you can see like Partially strong differences between the single wikipedia editions and that for example just to show you some things If you look at the english wikipedia edition and the answer here for how often is wikipedia Primarily used for work school related tasks that is like here the purple bar Then you can see that for the english wikipedia edition that happens only in around 10 of the cases While for the spanish wikipedia edition that is the case for almost 30 of the cases And just to give you another example if you look at how often are people motivated by just being bored or randomly browsing wikipedia Then for the hindi or the romanian wikipedia edition that amounts for only like 10 of all the survey answers While for the english or the japanese wikipedia edition that is about 20 of the survey answers The final note maybe here on on this Part of the results what you can also see is that the answer option other was only mentioned At most in 10 of the cases So that also gives us some confidence that the taxonomy that we came up with in the freed assessor study Is quite robust and also holds at least to a certain degree also for this 2017 study Okay, so how does it look like for the other answer options? So the second question was about the information need of actual of different users Sorry, what you can see here is that overall the free possible information needs these three answer options That is fact checking and overview And in-depth reading Are on average relatively equally common, but you see like really strong difference between the separate languages So as a tendency when you look at the results, you can see that for the western and east central european languages like german English spanish hungarian or dutch You see that in-depth reading is less common And for these languages instead you can see like fact checking is more prevalent One specific outlier, of course is as you can see here by the one high red bar is hindi And for hindi we can observe that they're Almost around Two-thirds of all people answer that the information need that they have is in-depth reading And we did several checks. We also did check the translation of the survey question Once again to see if there is some deviation or that that could explain this kind of outlier So far that is really what we can see in the data And we we don't think that there's any kind of of a technical error on this side And we will also come to possible explanations later on when we look at the correlations with the country level statistics Okay, so let's come to the third Survey question and that was about the prior knowledge What we can see here is as a trend something That is similar to what we could also observe for the english edition in the last survey And that is overall roughly the same number of people feel familiar with the content they look up on wikipedia versus they feel unfamiliar with And as a tendency again, eastern european languages like hungarian romanian russian ukrainian But also dutch feel more familiar with the content While asian languages with the exception of japanese report to be more unfamiliar And one possible for reason for this could maybe something that Reporting that you are so far unfamiliar with a specific topic Could be related to the social desirability of humility in these asian cultures But that to a certain degree is also just speculation Okay, so much for the direct survey answers Let's come to the third part of our results and that is For these parts we correlated the survey results with the rep request logs So how could we do that? So first of all from the predecessor study, we found that there are several patterns In a binary form so in a true false form that we could find for english wikipedia That are correlated with some kind of behavior for example one possible pattern that we can find From the rep request logs would be like someone is browsing at night Or someone has a session length of greater than three that means that he looked at more than three articles During his wikipedia viewing session and so on So now what we can do is we can look at specific pairs Of behavioral patterns as we manifest themselves in the rep request logs and specific survey answers So what we can do here is we can for each of these pairs we can plot We can create a plot like the one that you can see here and within this plot each language has like one single dot Where is this lot plot where is this dot in the plot located? Well on the x-axis here we plot like what is the probability of the behavioral pattern given a certain survey answers So for example, what is the probability of internal referrers? So basically of browsing wikipedia by their link the ability of having this behavioral pattern here an internal referrer um Given a specific survey answers here being bought around So basically The consequence of this is if the specific Survey answers would not have any influence or any correlation With the behavioral pattern from the browsing then these dots should be here all directly on the red line Which is the diagonal On the other hand if we can see that a point is significantly above the red diagonal Then we can say that in specific this specific language addition There is some correlation between this behavioral pattern and the underlying survey answer Okay So if we can create one of these plots for all of these pairs and actually we did this for like all these 247 pairs We had as well patterns and survey answers Then the question is how can we find something interesting and that is we saw everything by effect So what we describe as effect here is basically what is like? The average deviation from the red line to the top for each individual point And then you can also normalize this then you get something like a normalized mean effect And if you start by this year, we can find like here A long list of different associations between behavioral patterns and survey responses That hold on average between well Survey answers and the behavioral patterns and just to show you a couple of them which are specifically well Significant What we can for example see which holds across All language additions on average is for example that board and randomly motivates its users So users that give as the motivation being bought or just randomly browsing Wikipedia Can be associated with long sessions. So they browse many different articles They use basically internal navigation So not so much just coming from google and looking at one page and leaving again But they are using like the link structure of Wikipedia They also do more often than the regular users serve Wikipedia at night times And they also have like short time between the requests So this might be a hint that they don't really read the articles like in detail But they just skip the to the next article to the next article going down the rabbit hole Another thing that we can associate is like when people are motivated by being work or school Then we can associate this with desktop usage. So then they are not using a mobile platform People have then long intervals between the different requests to Wikipedia articles and these kind of tasks also More often than others occur at the afternoon And to give you like a third example, we can Be motivated by conversations. So something came up in a conversation. We can correlate this kind of behavior of survey answers With less internal navigation. So this is more like a quick fact check check up This happens more often from a mobile platform and we also have just short dwelling times on the individual articles Okay, so these are effects that hold on average And mostly most of these common effects also Are relatively stable across the different languages So for example, when you look here at the a in the upper left corner Then you can see we have the effect in the same direction basically all Of the languages That are ever does not hold for most of the associations we conform with like work school related behavior So basically what you can see on the second plot on the b plot is You have like a widespread of different effects. So the distance is to the red diagonal for all the different languages So that gives us an hint that actually Work school related browsing of Wikipedia is in fact hugely different across the different language editions What we can also observe is that in general the spread across the language editions is stronger than the effect of the motivation So basically the distance here from the left to the right is is much larger than basically The distance that you have from the individual language data points to the red line and finally What we also can see when we compare the results with results that we got from the predecessor study Is that we find that not all of the patterns that we could see for the english wikipedia hold across all languages So for example for the english wikipedia in the last study and also for this study We could observe that if people get mentioned as the motivation that they are being motivated by a current event Then they are more likely to to look at a longer article And this as I said also holds here for the english wikipedia But as you can see about all the other plots that does not hold in general for all wikipedia editions Okay, so let's come to the the last and fourth part of the results And that is here we correlated survey responses with country statistics So how did we do this? So actually we had this survey answers and for each of the survey answers We knew from which country the respective survey request came from So what we did next is we split the data by the language and by the country level So basically we get now one data point for each pair of a country and a language And if you filter this for country language pairs that have at least 500 responses Sorry, then we end up with 43 different language pairs And for each of these pairs, of course, we can Now on one hand compute like what is the share of a specific survey answer So for example, how often did people give us their answer that they are being motivated by intrinsic learning? And on the other hand, since this is now on a country level, we can Get data on the specific country for example on the socioeconomic status And the one main statistic that we hear focused on is the so-called human development index HDI And this is some geometric mean of the life life expectancy the education and some indicator for the income of the people there And what we could find then I think was for me One of the most interesting findings of this study and that is that we can really find clear correlations Between certain survey answers that is certain motivations and intents of Wikipedia readers With the socioeconomic status of the country they have been in And in particular what we find here for example here in the line with media and work school is The higher developed the more industrialized the country is the more likely it is The people are motivated by media and the less likely it is that they are Motivated by some tasks related to work or school And an even clearer correlation is when we go to this information need question So what we find here is with a correlation coefficient of 0.66 People are more often People use Wikipedia more often for just fact-checking when they are in a more industrialized country compared to in-depth reading And if you look for in-depth reading for example, then we can see that there is a strong negative correlation The more highly industrialized the country is The less common is really in-depth reading and this gives us an hint that really the the use cases of Wikipedia Are not like homogeneous across all countries for the whole earth That wikipedia is used differently across the world okay, so It's on this finding we also tried to to zoom in a little bit more and for this we specifically looked at the Spanish Wikipedia And why did we choose the Spanish Wikipedia? Well, the reason for that is that this Spanish Wikipedia is read in a wide range of Different countries and differently developed countries. So basically we have like the highly developed industrialized countries um in europe and also, uh, spanish minorities uh in in france or gb for example And Spanish is on the other hand also spoken for example in complete latin america with different ranges of Development and also for example the philippines. So basically for the spanish Wikipedia edition we see like the complete range of different countries with different use cases So what can we do with this kind of information? So basically for the spanish wikipedia as for the other wikipedia also We could use lda latin directly allocation to find topics in the wikipedia And then we correlate the specific topics which we manually labeled um with The hdi the human development index in the respective country And what we find now is something that is fits well to the result that we had previously And that is for a topic lies like math physics and technology Research and education or medicine and biology We see like a really strong correlation coefficient by well less than zero minus seven minus zero dot seven Sorry, uh with the hdi. So basically The more highly developed the country is The more the less of these scientific academic topics are viewed on uh on wikipedia And on the other hand if you look at more like leisure and entertainment topics like media culture or sports and team Numbers list and sports Then we can see that there's a clear positive correlation with the hdi So basically these kinds of topics are more often viewed in the higher developed countries Okay, so let me summarize what we did here in the study and what we found out So what we did in the study we uh surveyed wikipedia users about their intense and 14 different language editions and we received more than 215 000 survey responses and What we also had as information source was uh, we had access to the web request logs And by combining these two data sources, we could achieve two different things The one thing is that we could use the web request logs to de bias The potential biases that we had in the survey requests And on the other hand we could also find association between specific use cases and user behavior as they manifest themselves in the web request logs And then based on on these kind of of data we could Get to different kinds of findings. So for example, we could find that the behavior uh across wikipedia language editions is motivated by a wide range of different Intents in all of the editions, but also the the prevalence of these different intents varies strongly across the wikipedia editions And english is not in all cases representative for all languages in many cases. It's even more like an outlier A second finding is that there are some patterns that hold across languages Patterns that associate specific user intents with use cases On the other hand not all of these Patterns that we could find for the english wikipedia in the last study hold across all languages So also here we see that there is some heterogeneity between the wikipedia editions And finally We could find strong correlations between socioeconomic indicators on the country level with specific wikipedia use cases in these countries Okay, so that's basically what we find in this study or these are the main points that we find in this study Maybe some ideas where we could go next So one thing that we basically opened up with the last study was correlating with these socioeconomic indicators on the country level Which seems from a knowledge equity point of view super interesting to me Just like the socio demographics of wikipedia readers The first of all to quantify them but also to to try and characterize the motivations Um of specific user groups like what is the the Different motivation of young people or female wikipedia readers compared to to the others Then another open point with respect to these socio demographics is can we find inequalities that are finer than just on a country level? and of course depending Or associated with this question of socio demographics also Can we find out like who is currently not reading wikipedia at all? Maybe can we help these people find to wikipedia? Another completely different point of view is how Another research question is what about language switching behavior? So so far the complete studies that we did was always focused on one wikipedia edition after the other Um, but actually many people switch languages between or within their wikipedia reading session And investigating how often does this occur between which pages in which topics does this occur specifically? Why do people switch languages within a session would also be like another interesting research direction for the future? And finally so one task we did not succeed in so far was really coming up with a good quantification for a specific wikipedia article What are the prevalent use cases for this specific article? And one idea we are currently floating right now is maybe we can have a data challenge and let the community help us with this well methodological problem So share your question with us Sorry So Thank you all The paper is already uploaded in archive. You can find it here under this address. Thank you all for having me here There's also an ongoing documentation of our work, which is well still in its growing phase And thank you all and well, I'm open for questions Thank you, Florian and Leila. Uh, it's a great presentation Uh, also it'd be just sort of to remind people that this work would represent their wisdom. Um, next year So we mentioned that Um, so I'm gonna ask Jonathan to relate questions from a variety of channels I was also looking at YouTube, but it doesn't look to their questions there. So Jonathan Hi Um, so we have at least two questions for you, Florian. Um, so the first question Is uh from Anne Hathager and he asks were article lengths normalized by the article length distribution for each wiki? There are more long articles about popular topics in bigger wikis than in smaller ones Florian, you're muted Thanks Leila. The answer is no, we didn't have a lot of questions Thanks Leila. The answer is no, we did not normalize for that But so there we basically used the fixed threshold of I think what was a long article something like 4 000 characters. I'm not sure about the number anymore But we did not normalize that to make it like more comparable across the different things um Actually, uh, what we are interested in however does not depend on just the article length But it's basically how common is visiting an article of this length With the when you have a certain intention compared to when you do not have a certain intention So even if on general more articles Have the specific length then we still would see like an effect if the motivation is a specific one Excellent. Thank you. Um, one more question from irc. And then I believe Daria has a question This question comes from user net rom uh more more qang Uh net rom asks is this survey now a yearly event? If so, will the languages stay the same or are there plans to change it? And could you discuss why it would change or not? Well this question I will relate to Leila Yeah, it was almost going to be a yearly event and we missed 2018 so So I think the choice of languages will continue to be In part relying on the communities like the editor communities and their availability and willingness and Interest in this research So I would say at least part of the languages whether they appear or not is whether there is request by the community um, and then we always for any kind of language Related study that we do we still need some diversity of languages, right? So if community exercise that no we don't want to participate in the second round of or a third round of this research Then we will need to look into other communities who may be interested to do it So I would say the choice of languages may change and this really primarily relies on the community choices editor community choices moving forward and just to Give a sense of why this is the way it is. I mean, obviously the community you should be happy if A survey is going to go on under Wikipedia But also that we rely heavily on them for this research This is something that we didn't mention in in this presentation, but We need for each language that participates in the survey We need one point of contact from that community is going who's going to basically talk And make a case for their community that this is a good thing to do And they will answer all the questions throughout the study that comes up in their language community And they also spend hours with us for the translation of The the survey in their language as you can imagine the translation for this kind of research is very tricky And we can't we try to outsource this by doing paid translation and it didn't really work out well So we rely on editors who are interested in this research To come and work with us to have quality research moving forward So that's another thing that we need to take into account when we choose languages Excellent. Thank you. Um, we have a question from youtube This question comes from james Ellesman And james asks are there any recommendations for editors developers and users of wikimedia content Or opportunities to operationalize the results of any research survey or reader behavior analysis research Yeah, and I posted in the sidebar in case you didn't understand why I garbled Yeah, thanks Um, so I think I can I can take this one. This is a good question. So I think um one thing florian mentioned is that Okay, so let me step back I think one of the things that would be really interesting in terms of the editor experience and the reader experience Is if we can find uh ways to bring editors and readers Experiences closer to each other so at the moment From the editor perspective The reader is really represented at best in terms of their page views or the page views that they bring to a wikipedia article And maybe some other characteristics, but uh editors for the most part don't have a very deep understanding of the readers of their language and their articles So one thing florian mentioned that for example the idea that we had for the data challenge was Can we have a system in place that for every wikipedia article in a given language can tell the editor? What is the distribution the predicted distribution of use cases of this article? So if you are an editor and you know 90 of the people who are going to come and read this page Are people who are motivated by work or school? Related topics or are people who are going to do fact checking versus in-depth reading This is the kind of information that may inform you as an editor how to Write the content for this article um So that's kind of one line and then the other in terms of operationalizing I think one thing which at at the high level and it's important that we have learned through this study is that It seems that one size fit all solutions Will not work in the sense that we need to probably have tools and features and development happening um At the central and almost like homogenous level across all projects These are going to be for the basic needs of the readers But then because the readers The distribution of their use cases of wikipedia and also the characteristics of readers is very different across languages I think there's more work needs to be done for us to understand How to do Tool development, so I would say it's at the it's informing some of the discussions right now around Whether we should go down the path of more distributed model or or not for readership um, I don't have it's really outside of the scope of really My space to think about the actual product, but I'm hoping that it can inform those kind of discussions excellent, thank you leila and no other Questions on youtube or irc at the moment. Uh, daria. Do you want to ask yours? Yeah, so a question that is goes back to something we discussed during the early stages of the uh of the project I think it's also useful to unpack it here for for the audience um, but you touched Florian on, you know, what is not on wikipedia and uh, what readers will not find on wikipedia So the question of how coverage Um affects readership is a very critical one for this study. What we're observing here is not um An abstract notion of a reader but is a combination of reader that happens to be on wikipedia Because of what's on wikipedia in any given language or any specific topic? Yes, and no So if you think for example back to to the last part that I presented with there is topics for spanish wikipedia So basically all readers of the spanish wikipedia have access to the same content But still what they look up at wikipedia is different depending on their socioeconomic context Correct, correct. But yeah, but we only ask a question more specifically about this So well, we know that Something that is not covered in in wikipedia by design We will not see because uh The way we design like this is really uh looking at requests that result in at least some visit to wikipedia If your copy is completely totally missing, of course, we're not capturing that If a topic in any given language is covered by another uh very prominent reference work say i'm just going to make this up Hungarian there's a very uh popular Source that provides a Information along any of these dimensions. These are something that's going to affect our our our study Which is which is great. I think acknowledge in the uh in the study itself I have a question is more related to the actual topic distribution within the language um A language having a higher prevalence of topics in Say a given area that might be more relevant for some motivations say more academic topics or more popular topics um Presumably will give Different opportunities for people to browse and link Follow links uh to these topics and they by effect also What we see in terms of their browsing patterns. So my question is like a um What are your thoughts about the intrinsic distribution within a language edition of topics? When it comes to comparing across languages The probability people have of browsing while still Being the same information need mode that makes sense Yeah, I think that's that's not so easy to answer. I mean, first of all, how do we measure? I mean right now we are still struggling with like finding for an article What are the motivations for this article and as long as we cannot quantify this? I think it's super difficult to to measure on an aggregated level Like what is this type of Wikipedia edition actually supporting? So so basically if you think of it as building blocks, I think what we have to say first is what we have to solve first is basically Can we find from an article content and their embedding of the article in the Wikipedia network? um What is like a typical use case for this article? And they extrapolate from there, but but of course it's something like an hen egg problem, right? So The motivation of the Wikipedia readers is like it is because the content it is like it's it is there And then of course the content is also probably shaped in some form regarding to these use cases right Yeah, I was thinking of some normalization as a function of the topic distribution within a language or something That direction, but yeah, it's a very good point to focus on the article level Is that the right the right next step? Thank you Hey, thank you New additional questions on IRC In that case, I'd like to kind of build on that. So when we talk about the human development index results I think there's kind of two different ways to looking at one is that you know socioeconomic status is correlated with these varying motivations or other information these the other Question is getting back to who is and it's not only which is in some of these countries with maybe a lower HDI We're seeing a very selective group of readers who are on Wikipedia and I was wondering if you had any insight into The degree to which you're seeing self selection into or not self selection but selection into reading Wikipedia versus actual scs related variation So the question is so basically what is really dependent on the socioeconomic status and what is like with Things that are present in these countries You can imagine a country with a low HDI But that would mean that the only people for possibly reading Wikipedia are people at the higher end of social economic status Given the context of that country Yeah, that's right I mean basically we do not know like on the final granularity Who are the people in this country who really look at Wikipedia? That also goes into the direction where we would like to go next So when we could ask people in addition to these motivations Also, like what is there the demographic background age gender? But maybe also like what is the education and their income? And we could get at least like one step nearer to to these kind of question like is there a certain Group within a population of a country that that really looks at the Wikipedia content and what is like their need and that of course Points then back to okay when we know who reads Wikipedia within the country We also see like who is not reading Wikipedia in this country And we see like maybe what is missing either in terms of content or in terms of access possibilities I have one more piece then I think there's a really cool aspect of this study in terms of methodology Which is the debiasing of the surveys and you talked about it briefly And I know there's some more details on the table I was wondering if you could say a little bit about the types of motivations that were More likely to be answering the survey or less likely to be answering the survey Yeah, so it's out of my head. I don't know much so what I can tell you what is definite An effect of doing this debiasing is that the motivation board random Um Gets less much less weight than it would have when you have like just the direct survey answers But intuitively it makes sense, right? So if someone is just bored and randomly browsing Wikipedia Then he's just more likely to answer a survey compared to someone else and The way that manifests itself in the actual method is if you have someone who has like Many requests and any short requests then this on average gets downrated Thank you Yeah, thank you We have one more question from IRC that's hot off the presses. This is from tilman bear Tillman says a follow-up question to lay those remarks on the product slash interface implications of these results Does the team happen to have concrete suggestions or ideas for products or changes that could be informed by what we learned from this work? Yeah, that's a that's a good question tilman. Um, I I tend to thread this line kind of very carefully because we we are really as much as we would like to be further ahead We are in early stages really for this work. So I would say one thing that If we get to a point that we can predict the distribution of use cases at the article level Then we are close to making some recommendations or suggestions at the at the product level The other suggestion that I would have would be kind of more general And maybe reinforcing some of the conversations that are already happening in at the in the product teams and the media foundation, which is Again, it seems like Kind of the centralized solutions are not going to be the only solutions for readers And we need to kind of think about more distributed ways for readers to Customize the experience that they have on Wikipedia Um, do you expand when you say centralized and I'll share it synthetically what you mean by centralized Right. So the question is do we have systems that basically decide do we want to do personalization for the readers? um at the Aggregate high level. So let's say at the at the language level or at across languages level Or do we want to allow the reader? To tell us what they need. What are the specifics and dimensions that they want to have control over and based on those We we basically give them the basics of tools and features to build the experience that they want to have on on wikipedia um, and I think that line of discussions and thinking about product ideas Seem to be reinforced by the results that we see in this research because of the facts that readers Motivations or the distributions of motivations and also that characteristics seem to be different across these languages um, it seems we cannot possibly think about Um, just as an example recommendation system Let's say article read recommendation systems that work in one way one algorithm that handles things in one specific way for all languages We need to think about how to become more granular Without making the system too complex in terms of product development and would allow and design systems that would allow the reader to pick and choose Jonathan anything else no channels. I do additional questions on the channels at the moment continue um, okay, then I'll I'll I'll ask another quick question About like two specific use cases that are pretty interesting. Now one is we know that um I should do a very related We know that wikipedia articles like long form articles are typically not very well suited for finding very specific bits of information So I think broadly and qualitatively we know that in many cases what people are looking for a specific fact or this information is not already visible in the um So you're talking from the perspective of a highly developed country, right? I'm you're already assuming that you wanted to affect checking Oh, yeah, I'm just describing something that we have qualitative evidence that that's happening. Um, so Somebody we know that many people give up Uh In many cases like reading reading with the articles because I cannot find what they're looking for And this behavior going back and forth research engine is a very common one where people like refinder queries Use a search engine go back to wikipedia In some cases we know that there's a many people just asking directly a question that might be sourced to wikipedia On a search engine because it's easier to find it there find directly the answer But it's pretty extensive evidence. This is the case regardless of whether this is more prevalent in western or northwestern countries But um, the question is whether A You have found anything with the data that you collected That might be indicative of this behavior like basically dead ends in In information search when people are looking for something specifically that doesn't result In in what they're looking for of course inferring whether they found that looking for is a tricky question And in general if there's something in your future direction that we can look into Related to using the referral class for better studying this like a cross-platform behavior that we know happens on our sites Yeah, I think the short answer for this is uh, at least using this user identification that we currently are using those That was one topic that we that we did not talk about much So we do not have like any cookie tracking or anything. So basically we are relying on user identification On a concatenation of the IP address in the user agent At least giving like this kind of user tracking thing I think even the majority of the users just has like a single access request to wikipedia articles And that of course also means that the majority of the users just directly come from some search engine Look something up on wikipedia and are never seen again At least from our perspective And that of course makes a lot of follow-up questions super difficult because the people just have like one visit Yeah, and if I may add to that, I think one thing which is related to your question Dario is that what will be observed that on average 20% of the reader sessions involve switching from one language to another one wikipedia language to another wikipedia language And that is something interesting to look into Why do people switch on what topics do they switch? Um, I think that can be related to the fact that what are the They find the content in in the language that they're reading whether that's the motivation for them switching Is the content there? But then there's certain types of content that you understand better or more easily another language Or actually the content is missing and that's why the reader is switching But that's only one part of what you're asking Thank you No new questions have come in Sorry, I missed that. No other question or additional question. No, okay. All right, so I think with that if there's no no other question, we should close it here. So again, uh, virtual random applause for our two speakers Thanks for joining today. Florian. Um, thanks Leila for representing this This closes the 2018 series. So Happy holidays everybody. Uh, if you're celebrating or taking a break in the coming days We'll be coming back in january In the new year with a new brand series of research showcase See you all