 the last I would say collective session of the day so now we have one hour and five to 15 minutes together to hear from all the authors of Wiki Workshop about the amazing work they have been doing. I'm so happy that we managed to get 13 papers accepted for this workshop and some of them will be in form of or longer oral presentation three of them and then nine of them they're going to be in light and talks one unfortunately couldn't make it today. So I want to just introduce the contributed talks with huge thanks to all authors for many things for their work for coping up with us in all the changes that we have gone through due to this you know exceptional circumstances with the amount of emails I've sent you and with basically all the dry runs we need together. So we are ready to go another huge thanks and big heart to the PC members and the reviewers who have done a fantastic work to select very high quality content for this workshop and with these huge tanks I think we can start with the contributed talks so Vladimir is going to talk about what's popular in Wikipedia what's trending in Wikipedia capturing trends and language biases across Wikipedia editions and if you can share your screen. Perfect thank you. Great all right. Hello everyone my name is Vladimir I'm a PhD student at EPFL I'll talk about trends and biases in different language editions of Wikipedia the importance of biases in Wikipedia has already been mentioned today multiple times so I'm extremely excited to share our recent insights on this topic with you. The main goal of this work is to analyze the differences and commonalities in readers preferences across English, French and Russian Wikipedia editions. To do that we detect trends using Wikipedia viewership statistics and analyze collective interests of the readers in those languages. Just to give you a quick intuition of what we're going to do here I'll show you this teaser. After the presentation you will have a chance to interact with this visualization. Here you see Wikipedia articles represent these dots connection between them represent hyperlinks between articles and as you can see these are the pages related to the same topic Miss America 2018 and if you look at the timeline in the bottom of the slide we can see that these articles are not very popular in the beginning of September but on the 10th of September when the show was broadcasted a lot of people start actively exploring the cluster and once the show is finished this interest fades out so this way we detect events that interest Wikipedia readers demost and use this approach to compare trends in language biases across editions. Now I'll tell you more about the technical side of the project. I'll start with the dataset. We analyzed three Wikipedia language editions which is around 11 million pages connected with 700 million hyperlinks. All data is anonymized and we don't know location and click pass of the users. Second I want to introduce the methods that we used. First we extracted sub-networks of trending Wikipedia articles. Then we extracted keywords from the Wikipedia summaries of those articles and after that we have defined nine high-level topics based on the keywords and finally we have labeled all the sub-networks with those topics. Now we'll briefly cover every step in more detail and I'll start with the sub-network extraction. We use the fact that Wikipedia has associated structure. As we saw in the visualization in the beginning of the presentation if readers are interested in the topic Miss America 2018 they will go and visit corresponding Wikipedia page. There they will see links to related concepts and since readers are interested in this topic they will want to click on those links and explore them in more detail. This is one of the main concepts of Wikipedia. If you formalize this concept we can represent this walk around this kind of graph where nodes are Wikipedia pages and edges are hyperlinks between them. Different users will have different paths but they will still visit the same pages related to this topic. Whenever they click on a page or a link in Wikipedia their activity is recorded and stored as viewership statistics. To extract trending sub-networks we use an algorithm that takes both network and viewership statistics and finds dynamic patterns in the network so that if two pages have similar level of activity in a given time frame it assumes that those pages are related and reinforces the connection between them. Once the sub-network is extracted we use Luven Community Detection Algorithm to detect clusters of trending pages. When this is done and we have extracted sub-networks we need to define topics of the clusters of trending Wikipedia articles. We have used summaries of each page and extracted keywords and based on those keywords we defined higher level topics of each cluster. Then we labeled a subset of clusters, trained the classifier based on this data and labeled the rest of the clusters in a semi-supervised fashion. As a result of all those previous steps we get a label sub-network which looks pretty much like this and now finally like the most interesting part let's compare the readers' preferences and see the results of our work. So we analyzed free language editions of Wikipedia. Note that these are not related to countries, locations, because we don't have location data. We have defined nine most popular topics that interest Wikipedia readers and based our analysis on those topics and also note that we have split sports and football in separate topics to avoid imbalanced classes when training the classifiers. So this is just a technicality. First we have focused on global statistics and got quite interesting insights. We can see that English speaking readers are mostly interested in sports. French speaking audience prefers content about movies and music and Russian speaking readers are mostly focused on science, complex, politics and religion. What is interesting is note that topic religion is not trending among speaking French speaking readers during this period that we have studied. Now let's see what kind of topics are common and what kind of topics are different across three languages. We notice that interest in global pop culture and traumatic events like 9-11 and Stanley's death are common across languages. This sparked interest among readers of all Wikipedia language editions that we have studied. However, these are many events that appear, there are many events that appear only in one language. For example, local events related to natural disasters or local politics appear only in locally spoken languages. For example, here we can Michael appeared only in English, Quebec elections only in French and Soyuz spacecraft related aerospace events appeared only in Russian. So why do the results look exactly like that? We have a few possible explanations. First, different media coverage. According to the study cited below interest of 25 to 30 percent of Wikipedia readers are driven by news. We can also see a similar tendency in our study. Second is cultural differences between Western and Eastern cultures. As shown in the same study, readers from Western and Eastern cultures have different motivations when coming to Wikipedia. Most of the readers coming from Eastern culture are driven by so-called intrinsic learning motivation and this explains the domination of the topic science among Russian speaking readers which we see in our study. Finally, we can see that local events get more exposure in local spoken languages. I would like to conclude saying that Wikipedia can tell us more than is written on its pages and it's an interesting source of data that allows us to study cultural biases. In the future work, we would like to extend our study to more languages as was suggested by one of the reviewers. We have tested Wikipedia as an additional source of information and turns out it simplifies topic detection pipeline, making it possible to easily extend this study to more languages. If you're interested in the results, visit the website of the project where you'll find latest research and data sets from our group and also on the website you'll find the interactive demo that I promised in the beginning. Thank you for listening and now I'll be happy to answer all your questions. All right, Isaac, I am seeing a few questions. I think we have time for one or two, Max. So yeah, please keep it on time. We have one two minutes basically. Yes, thanks. First one is I think clarification question. Lucy says I can skip that. So next one coming from Rohan. He asks what are the sizes of the subnetworks? Are they triangles or other structures? The sizes, well English, Wikipedia is the largest. Subnetworks are around 5,000 nodes more or less and for Russian and French they're about 1,000 pages. All right, I actually have a question then. I'm curious, you know, so you you described some of the reasons why some of these topics spike in certain languages and others. I'm curious if you were trying to surface certain topics as trending in other languages, whether there are specific like types of topics that you think would be more pertinent to surface globally? Well, actually we're thinking about studying only topics that are common across languages. So for this purpose we wanted to actually there was a contribution to our dataset recently that allows us to study language links. So these are the links between different languages on the same topics so that will allow us to study on the topics that are common across languages. So we haven't studied yet. We can only spot manually right now but if we change the dataset it's certainly possible. All right, thank you then. All right, thank you very much Vladimir. So the next, thank you very much and thank you Isaac for handling the queue. We're having problems in letting people in the meeting so I'm handling the situation in the meantime so apologies. So the next long talk is from Kai and it will be about content growth and attention contagion in information network addressing information poverty on Wikipedia. Kai, if you were there, I know you were there and can share your screen. Yeah, can you hear me? Thank you so much. Yes, we can hear you. Thank you. Great, great. Yes, so hi everyone. Thanks for having me. I'm Kai Zhu from Boston University. So it's great to be here and talk about our study on content growth and attention contagion on Wikipedia. So this is a work with my colleague Dylan Walker and also Lev Neuchinik. Yeah, we all know Wikipedia. It's great. And every day a huge amount of information figure end up getting information from Wikipedia but however Wikipedia is not without problem. In fact, we know that it's subject to knowledge disparity across both geographic error and also knowledge domains. So Graham and the co-author, they have a study. They examine how geotech Wikipedia articles are distributed throughout the world. So they try to model and understand using different explanatory variables yet they still find that some part of the world remain well below their expected values and others have performed similar analysis across knowledge domain from a similar problem. So this is problematic because it can lead to sorrow of knowledge and bias our perspective of what is important, right? So the question is then why does this happen? What do we know and more importantly what can we do about it? So for the sake of time I will skip the literature and relevant work and jump directly to what we do. So there are a lot of studies research focused on Wikipedia editors, right? Editors motivation and their entity and so on. But we take a different perspective. So in this study we focus on the articles instead. So in a complex social technical system like Wikipedia the articles are not inter and are not independent. They are interconnected with each other via the hyperlinks. And for a given article we would like to know how does attention made by something like a page views drive content production for that page? By the way, how more content may drive more attention attract to the article? Like this feedback loop whether it exists and also Malware does attention spill over onto other articles since it's all connected other articles in the Wikipedia networks. But the problem is in a natural setting our different thoughts come into play simultaneously, right? So the color influence in this setting is challenging. To overcome this we leverage a nature experiment which involves content contribution shot on Wikipedia. So the nature experiment is the Wiki Education Foundation they have conducted for several years now. They work with college instructor so they can create assignment for college students to create or expand a Wikipedia article as part of their class assignment. A lot of articles actually have been expanded or created during the year and the scale of the campaign is very large. So this give us a chance to study what's the impact of exogenous content contribution shot on Wikipedia. And particularly we would like to focus on the contribution to the underdeveloped articles. So to infer the counterfactual of not receiving content contribution we also construct a set of control articles that did not receive contribution from students and match them with treaty article based on article characteristics. Okay so what do we learn from this nature experiment? So first we find that there's significant lift in the post shock page view for the treaty article compared to articles did not receive content contribution from students. So it's about 12% of the lift on average compared to their own baseline page view before the shock. And the effect is I want to emphasize is it's long lasting right at least didn't appear after 26 weeks after the content contribution shock. And I also find that the effect is stronger if the article is less popular to begin with also when the article received more content contribution during the during the class time. So which can be of the higher 30 percentage lift. And not only page views when we look at in terms of editing activity also treaty article receive more edits and also more unique edits in the six months period after the content contribution shock. So a nature question following these results like where does this attention come from right? So thanks to the release of Wikipedia collection data we can compare the traffic south for both treaty article and control article. So what we found is that is really the increased traffic come from both internal links and external website. The internal traffic is explained by more incoming link right we found there's more incoming link from other article point to the article being being edits by the students. So that will bring more traffic and also external traffic is probably explained by search engine visibility. So that's about the source of traffic then what about spilloff attention right since the article are connected so there's something this is something we're really interested in so the key finding here is the creation of new link is very impactful. So when edit edit article during the shock period we can also add hyperlink on the page point to other articles right beyond just add a textual content. So students didn't do it enough but when they do those new links can serve to open the flood gate of attention right now the attention can spill over and both model free evidence and the model estimate reveal that this has a very substantial effect actually as much as like receive directly content contribution. So this attention spillover phenomenon really caught us or caught our interest so we begin to wonder can we create a policy right to leverage this effect to benefit information in published region in the land work just to imagine to have a set policy right in the first one editor will encourage to focus their effort on highly releasing groups of article and not only that they deliberately also build up the network structure like link structure a rondos article while they are adding in textual content and we term this our proposed policy attention condition policy and in second one editor focus their attention on article without considering the relatedness of network structure so this is our baseline policy to compare with and we call it undirect attention policy right the question is which policy is better right and also how much better so the new view estimation will give us some intuition but I skip and go directly into the our main results which is the empirically informed page rank diffusion simulation so I believe a lot of us are familiar with page rank algorithm and the cool thing about our simulation is that we make diffusion follow empirical data on actual certain behavior again thanks to the clickstream data the actual certain behavior is known from those clickstream patterns okay so here is a graphic depiction of the two policy we are comparing so there's a lot of subtle detail about how we select the proxy for in published region how we match subnetwork for the two policy and also how we do the perturbation but I jump to the result for the sake of time indeed we find the attention condition policy needed to significantly increase in access attention to the in published region which is up to two fold on average and here access attention is defined by the percentage difference in page rank number for article in the perturbation and the result holds both for community and also clicks in the network as you can see the blue color represent attention condition policy because it's spread out to higher region of the value so it's on average higher than compared to on direct attention policy which cluster in the value cluster in lower range lower region yeah so so uh yeah so this is takeaway right do I get at your editorial effort does leave to significant and long lasting impact under the tension propagate over the information out of two hyperlinks and finally uh informational in inequality can be alleviated using policy that best leverage attention to build so this is our study uh thanks for your attention thank you so much gail jump in medium is helping debugging some of the meeting issues we have um thanks for your talk we have 30 seconds maybe for a quick question Isaac if there's a question from audience yep um from neil and this refers to some of those earlier graphs on the page view effects why did the control articles also see increased page views after the shock right right so um part of it because of personality right because we what we found is that during the semester like uh it's a good semester Wikipedia article have higher uh a lot of it have higher viewership so that's exactly the reason why we need to have a control group right it's a it's a you know the fluctuation on Wikipedia it's a difficult control so it's better to compare the treated with control so uh yeah and that's just what it was on there's one more for Benjamin if we have a quick moment did you look at the size of the edits that occur post shock as well as the number mm-hmm sorry I didn't get it did you look at the size in terms of characters added or changed in the edits that occur post shock the edits you mean the how much content being added to the so yeah uh we didn't we look at a number of edits but didn't really look at the size but we I could check that thanks thank you Isaac sorry um I caught you is that is that all from the question front okay thank you so much thank you so much Kai thank you and stop sharing your screen and we give the floor to Nick Nicholas Vincent um who's going to talk about deeper investigation on the importance of Wikipedia links to the success of search engines sweet yeah can everyone uh hear my voice and see my screen yes thanks awesome all right I will go ahead and get started so hi everyone I'm Nick I'm a PhD student at Northwestern in the people space and algorithms research group and I'm excited to present work I did with my advisor Brent Hecht looking at the importance of Wikipedia links to search engines so I will jump right in so I want to start with the underlying motivation for the work there's all this huge research interest and economic interest in intelligent technologies and researchers including folks here have highlighted many benefits of intelligent technologies in terms of things like saving money providing better services to people public peer production on the other hand the research community has also identified downsides like privacy concerns and exacerbated economic and equality and these discussions often focus on algorithms and platforms but another critical component is data labor the activities that people take to create data that fuels intelligent technologies and I think this has been alluded to in several presentations today already so why study this will be economic concerns are high stakes and um studying data labor is really about studying the sustainability of peer production because highly valuable data labor is performed by peer production communities at a high level by making people more aware of the value of this so-called data labor it's more possible to leverage that value I'm leveraging the value in some cases could mean like getting a paycheck but not in the case of Wikipedia for Wikipedia this might mean recognition agency or other forms of support okay so how exactly do Wikipedia and search engines fit into this data labor research agenda well search engines are incredibly widely used and hugely influential at Google's a verb and search engines are relying on training data clicks and the actual results that they serve teasers and a lot of research has looked at the relationship between them so in 2017 McMahon and all performed a browser extension experiment and found removing Wikipedia links from Google search results hugely dropped the click-through rate a really important metric and actually point out that this study was actually motivated by a call from the Wikimedia Foundation I think at the wiki workshop maybe someone can correct me if I'm wrong about that and so people have been aware of this for a really long time and then in a follow-up study that I led we actually collected a bunch of search engine results pages with scraping software or SERPs as they're called and we found that Wikipedia was extremely prevalent in the SERPs that we collected with some caveats so the takeaway is that Wikipedia is one of the most important search sources of results for search engines so we had some further questions what about search engines other than Google all of the research so far is focused on Google with good reason and also what about mobile results that people are increasingly using mobile devices and we ran into a technical challenge how do we handle the fact that these SERPs search engine results pages are are really kind of changing all the time and they look different for different search engines and they're not any they're not 10 blue links anymore so there's all sorts of things in SERPs such as the knowledge panel or the knowledge box which you'll see on the right hand side of a lot of SERPs the news carousel the twitter so carousel etc and these are all really important all right i'm going to dive right into the methods what do we actually do um when we needed to pick some search engines some devices and some queries to investigate so in this study we focused on google being inducted go and we considered both desktop and mobile devices and we actually also considered the effect of different screen sizes i'll talk about that towards end and the question of what queries to make is really the most critical and challenging um it's very tough there's no you know open data source of search queries that google publishes or anything so our approach was to identify multiple important categories and draw in past work and going really quickly through them basically there's three categories there's common queries which we took from a search engine optimizations company that estimates the volume query volume so there these are things like facebook youtube amazon we got trending queries from google trends things like world cup or thank you next and then finally medical queries from prior research that i use bing data there's things like how to lose weight or indigestion okay so our approach for data collection we use puppeteer which is a no-js software to run headless chrome or uh basically a web browser you don't actually need to have a screen you can run it on a server and we forked uh the se scraper library and our version focused on recording and analyzing link coordinates within the space of a SERP i'll talk more about that later and i'll also put repost all the code links on the final slide and paste them in a chat as well if you want to check it out um so the old approach for SERP scraping is the researcher might look at the html maybe with something like this and write a bunch of css rules and say okay i'm going to find all the elements with the class of search results abc 123 and get a ranked list um okay but what do i do if my search page looks like this as a lot of them do nowadays or zooming in a little bit like this there's tons of stuff there that's not obviously parsable into a ranked list so our approach is just let's just get all the links in the page all the a elements and basically calculate their positions their coordinate their x y coordinates and pixels within the page uh using JavaScript um and then you can basically have a page like this maybe i search for zoom you can look at the full page and define a full page incidence rate you could draw a line and say okay how many times is wikipedia appearing above the folder how often how about um in the left hand side of the page or the right hand side of the page you can basically keep drawing rectangles and define these different spatial incidence rates to calculate how often a domain wikipedia in this case is appearing within a SERP or a collection of SERPs so we looked at the full page incidence rate we also looked at things like above the fold the area above um where you basically don't have to scroll the top of the page i called above the fold reference to newspapers uh the data validation for this data is actually pretty tough SERP data changes all the time you might remember recently google changed their whole SERP and then had to roll it back because there was all this backlash um so how do we check that our data is actually good uh the basic approach just really going quickly through it is that we take a screenshot of every SERP we collect and then you can visualize your analysis ready data so like the json file that i'm actually doing quantitative analysis with i make a a visualization of it and i make sure that the screenshot matches up with the the quantitative data i also give to give you a really brief idea what it looks like on the right hand side this messy stuff is a there's a quantitative data it's every single link on a SERP kind of visualize with with some colors and on that plot lib grid i mean on the left is the screenshot of that SERP and i can actually just come in here's the researcher i did this for a big sample of my data and say okay on the right on the right hand side there's a wikipedia link i highlighted them in green they might not be super easy to see but i basically make sure okay does that wikipedia link actually appear on the screenshot of the SERP and in this case they do this data was good it wasn't corrupt we actually found a lot of errors this way and i had to rerun my results quite a few times because of errors with the SERP question all right diving into results um so we looked at these incidence rates how often is wikipedia showing up when we queried for a bunch of different queries a bunch of different devices and a bunch of different search engines looking at the full page instance rates that's the left after this picture we saw that wikipedia links were present in many 70 to 80 percent in some cases of common and trending SERPs but much less in medical SERPs um and only doc.go really is using wikipedia for a high percentage of the medical queries that we made comparing desktop to mobile we saw generally similar results the full page incidence rates were quite similar for mobile device as to desktop devices uh next looking at the above the fold instance rates the top no scroll required part of the SERP we saw the desktop results are still very similar in other words when wikipedia links appear they appear at the top of the page often on mobile however that's not true not too surprised in giving that above the fold is a much smaller amount of screen size on a mobile device and i'll note that we accounted for different screen sizes here by calculating these results with different uh above the fold lines that correspond to different devices we found qualitatively similar results so shown here is just the the middle ground estimates what kind of did a lower bound middle ground upper bound approach there finally now looking at this left hand versus right hand economy we saw that for the common and trending queries the right hand incidence rate was around the same or higher than the left hand incidence rate so this suggests that wikipedia's prevalence in SERP results is coming on the right hand side of the page from those knowledge panel style elements that you might be familiar with but they're not the only source there's the other sort there's actual there's blue links not just knowledge panel links to wikipedia okay so to summarize those findings that i went through very quickly uh using this easy to understand but limited measure of incidence rates it seems that wikipedia's importance to the successive search engines extends beyond google and beyond desktop formatted search results so this is a big replication uh an extension of prior work that's looked at this relationship and queries and devices matter a lot so there's differences in terms of medical devices there's the fact that if you're on a mobile device you won't see wikipedia on the top of your screen but on a desktop device you probably will and then there's also these things like knowledge panel elements which will definitely control or have an effect on on how users are basically seeing these wikipedia links so a couple discussion points this definitely reinforces the idea that data from the public is fueling these highly profitable and influential intelligent technologies and raises the question are wikipedia editors some of the most important employees of search engines and uh what would we do about this it's complicated as i'm sure this group is is hyper hyper aware um you cannot pay people to edit wikipedia certainly and there's also complications around funding relationships so what should search engines do should be more prominently credit wikipedia credit individual contributors solicit contributions i'm sure there's like a ton of maybe we can discuss this in the poster session i'm sure folks here would have tons of ideas around this finally this reinforces that wikipedia matters outside wikipedia on the positive side that means anything any research you do to help uh improve biases in wikipedia will basically affect the users of search engines which is almost everybody using uh the internet on the other hand negative things like uncovering biases or having problems any problems will basically propagate through search engines this really raises the stakes of the research uh some big limitations is a small scale audit study we don't have google's actual data this is still us and en only that's a huge limitation as people have talked about a ton there's a lot of geographic and language differences if anyone wants to try to do extensions across these i would be interested in hearing or collaborating and the queries matter immensely okay really quickly i want to say a big thanks to my co-author the reviewers tons of open software that this relied on finally i just want to highlight the community data science collectives cova 19 digital observatory this is a project to collect data for covet related things and there's SERP data for covet related keywords using a newer version of this data collection software that i developed after i did the study and so this might be of interest to you and finally here's some links in case you want to check it out or communicate with me and that is our 10 minutes and 15 seconds all right thank you thank you so much nick uh for staying in time on time isak is there anything from the chat from questions um there's nothing the chat i have a quick question if we have time though yeah yeah we do um the the question i have for you nick um is around the medical queries i'd be curious as to your interpretation so that was where we saw a big difference between duck duck go and and bing and google and wikipedia incidence rate um is that due to do you think kind of differences in their algorithm or kind of more explicit design choices that they've made uh my opinion is that and i i don't have like any causal evidence for this in my data my opinion is that there is an something explicit going on with the design choice the evidence of this is that google and bing both have a special knowledge panel or it appears to me that there's a special knowledge panel for medical queries i don't know what they're doing technically to distinguish a medical query from other queries but um the knowledge panel looks difference and it only has links in it from a certain set of sources so it's like mayo clinic um who maybe some doc of resources stuff like that so i think that's what it is thanks there's some really interesting differences in the covet data too where um bing was serving tons of wikipedia queries and google actually was serving none for a brief period after they got a bunch of for pushing this information and um yeah i i don't have the answers to that if folks want to dive into that i'd love to hear your thoughts about that a all right thank you so much nick so that would be the end of our session of long talks thank you so much for the three of you for uh taking the time of sharing your fantastic work we are having a problem that is that people who left the meeting and want to come back cannot access this meeting and they're all in a parallel meeting that i don't know why this is happening i am not a zoom professional so um the problem is one of the speakers of the lighting talks is in that other rooms so i can present on her behalf but we're trying to fix the problem in the meantime because also after this we are going to split in individual meetings and we should be back in this main room for a social event and so it will be great yeah yeah i tried lora thank you i tried to invite them and they are yes they're having another wiki workshop without us so basically uh we are trying to uh okay they are trying to solve it while i'm going through the lighting talks right data that is correct yeah i'll go i'll go ahead i'll go ahead there is no point of all of us to change the meeting link right now okay um we creating a new event for everyone thank you everyone for your your technical support so we talked about this and actually it will be basically the same link but maybe not i don't know this this would be probably a little bit too disruptive if 97 people need to change meeting right now uh yeah okay yeah and there's a chance that we may not come back to this one so let's not risk it with that yes yeah okay so be patient uh i uh need okay that's fine i am going to share the pdf with all your lighting talks give me one minute just to message the last lighting talk that uh i will present on her behalf because she cannot access for the moment so just one minute and i will be sharing and pablo you're the first so get ready yeah all right um ready so let me share this do you see your slides yes good uh this uh max stuff let's see you food squid mode very slowly this should work okay uh pablo uh and all of you there are nine presentations here you have three minutes each uh so please try to be on time uh we are a bit flexible with timings but please be on time and uh pablo the floor is yours okay thanks mirror and hi everyone my name is paulo etia and i am a doctoral researcher at humboldt university of berlin and the topic of my article is the geographical bias of information coded by wepedia up to date this bias has been studied only analyzing the number of articles associated with places but that approach is not considering the weight or centrality of the articles so i would like to propose a different approach sensitive to these differences between articles could you change the slide please does that work yeah yes it doesn't work um sorry that's interesting you cannot see that oh resume share there's something very wrong happening here i'm sorry uh no worries take your time oops why sharing is paused okay let me try to do this again sorry it went too well up to now right i mean when you're handling a remote conference with more than a hundred people something must go wrong but luckily we survived until now so now you can see that right right and this is your third slide so i'm going to go to the second sorry paulo to interrupt your beautiful presentation okay thanks okay my proposal is to consider at the same time two aspects the one hand the number of articles linked to places which is the traditional indicator of geographical bias but on the second hand what i call the positioning of the articles that is the internal weight that they have within this information system inside the second factor i propose to include two components first the exposure of articles in multiple languages and second their connectivity with other articles observing the hyperlinked network between them and then calculating centrality measures like the page rank and so my proposal here is to summarize both components of the positioning in a single indicator of the overall weight of each article in Wikipedia and i am working with biographies for my phd thesis so i call this indicator the biographical centrality index could you change here please okay so yes thanks to test the relevance of this new approach i made an empirical study with biographies of people that are available in 25 or more languages in wikipedia and my question was how geographically biases is the biographical information about these famous people and when you put all the biographies on a map with each person birthplace at the geographical reference you can see a really high concentration of information correct and just considering the number of biographies there are just five countries the usa the uk italy france and germany sorry that concentrate all the information yeah sorry that concentrate 50 percent of the information and this generates a genetic efficient of 0.79 which indicates a very high inequality yeah but this is the information provided by the traditional approach and if we then weight the biographies by the positioning summarizing with the biographical centrality index the results change significantly now more than 62 62 percent of biographical coverage is concentrated in the same five countries and inequality coefficient reaches 0.84 so to conclude uh one relevant finding of this paper is that the positioning seems to increase uh the estimation of inequality um if that's correct previous evaluations may have underestimated the geographical bias of information will be and finally i would like to emphasize that this methodology could be replicated without big changes for the estimation of other biases for example the gender bias thank you very much thank you pablo and next is uh jessica if you're around probably yes great jessica the floor is yours thank you oh thank you uh hello everyone i'm jessica rara i'm a phd student of social complexity sciences at the universidad del desarrollo that's in Chile and i'm delighted to present my work here on collaboration patterns of performing artists collaborations in performing arts make possible the creation of new artworks studying these collaborations contributes to understand one behavior that may be unique to humans the making of art this area has received little attention because performing arts are very difficult to track hence we propose the use of historical records to investigate the most relevant artworks created by performing artists we selected um only ballet and opera because they have the most consistent and cyclopedic records in our project we reconstruct the network of artists that collaborated in the production of new ballet and opera uh with data obtained from a wiki data query service shown here but we also use the real world network obtained from the historical repertoire of the Pittsburgh ballet theater the pbt a ballet company in the united states my presentation today focuses on the general descriptions of the network structure of ballet and opera yet our project also includes a comparison of collaboration patterns and community detection um the networks displayed on the next slide please um next slide thank you our static with no specific time window in our analysis a note represents an artist and a link represents the collaborations between artists for the same artistic work on this slide you can see the network composition by artist types in different colors we see that networks reconstructed from wiki data resemble the composition seen in the real work for instance the making of opera consists mainly of collaborations between composers and librettists while ballet requires more artist types such as choreographers and costume designers the pbt uh the network of the pbt showed a more even composition so this network represents individual working together to set performances on a stage and not only creating new works as those from wiki data in the table at the top of the slide you can see some of the metrics obtained from each network the wiki data networks those labeled as ballet and opera showed a larger number of isolated components but this was not observed in the pbt network this may be only an artifact of the wiki data coverage with more profiles but missing collaborations this feature of the wiki data reduce our chances to make more detailed network analysis yet we believe that the research is a good starting point to overcome the challenge of analyzing collaborative behaviors in the context of performing arts thank you for listening and i am happy to answer to your questions thank you jessica all the all the lighting talks and the oral longer presentation the authors will be present in the process session after this one so please keep your questions for later and the next app is uh ziang are you around we did your i'm ready yes good piro is yours okay thank you everyone i'm trezzang from beijing normal university and it is a great honor to attend this workshop and give a presentation about our paper domain specific automatic scholar profiling based on wikipedia in this study a framework for automatic scholar profiling is constructed to help junior researchers have a systematic understanding of basic knowledge in a specific domain which includes two major phases fine-grained entity typing and keyword extraction uh the next the next slide please okay thank you thank you personal record personal research record is considered as a kind of basic information for renowned scientists and junior scholars are facilitated to use relevant information as guidelines to conduct further studies to extract such information existing ner methods usually utilize structured information like info box in wikipedia to generate labeled training data sets however it may lead to severe mislabeling problems because once a term appears as the corresponding value of an attribute in the info box it will be labeled as the it will be labeled as the attribute wherever it appears in the article as we can see from the example university of paris appears as the elevator of mrs curie so sentences like she was also the first woman to become a professor at the university of paris will be labeled as her elevator but it is actually her working institution so uh to address this problem we propose an embedding method named transpeak to represent entities and the relations between them into low-dimensional vectors for further typing a series of experiments show that typing performance is largely improved using transpeak against other embedding methods meanwhile selective bibliography in wikipedia often covers a large amount of academic concepts which are helpful for junior researchers to know what has been achieved in one domain thus a keyword extraction method is needed existing supervised methods usually based on a binary classifier which means that traces are classified into keywords or non-keywords and fails to provide a rather relative importance of concepts to junior scholars to address this problem a new keyword extraction method based on learning to rank and add a boost is proposed it extracts keywords by ranking candidates according to their possibilities of being selected as keywords categories of the experiments show that the proposed method outperforms other keyword extraction methods on two datasets and that's all for my presentation thank you for listening thank you very much young um next up is chin chun from yahoo research hello can you hi please uh oh good to go okay hi everyone my name is chen chun from yahoo research today i'm going to introduce the topic layer graph embedding for entity recommendation using wikipedia in a yahoo knowledge graph this is a joint work with kinsan liu from stony brook and nicholas tozags also from yahoo research first uh let me tell you what is yahoo knowledge graph when you search for something on yahoo for example like a rabbit uh you will notice that there will be a knowledge panel as a right hand side of the search result page which includes the search entity's information this information is all powered by yahoo knowledge graph um today we are especially focused on the risk where people also search for part which give you the related entity recommendations for the entity you query uh next slide please next slide okay so our recommendation system is based on the proposed layer graph embedding in a nutshell it means to construct embedding by bias random walks and on separated subgraph as layers to generate the embedding we consider three different kinds of a graph the first is a wiki link graph which is a hyperlink mentioned on the wikipedia page and we take the top 10 languages in wikipedia to build a link graph the second graph is similar to the first one uh except that we only take links in the main text part and the number of the appearance of the links is also considered the third one is the clickstream graph which is also provided by wikipedia that consists of the number of a link clicked by user we treated uh each of the graph as a layer to construct our embedding and the embedding is great because the embedding is a great proxy of the entity's similarity so we take just take the canier's neighbor as a candidate and then we can just use the cosine similarity and other features like like page views for ranking to get a final recommendation result uh next page so let me show you some of our recommendation result uh our method provide us a generic recommendation of wikipedia entities uh with the power proper health of entity type we can do recommendations like doing uh people to people uh people to company or general entity query also uh since our embedding are trained with a different language we can also ask recommendations for different language like a query for chinese or a query for france et cetera uh this recommendation system is almost 100 power by wikipedia except that in production we also includes a yahoo search log occurrence to customize it a little bit but the result is mostly on part uh lastly the graph data say uh data say and training golden said will be public available on yahoo web scope soon and thanks to everyone and if you are interested uh the discussion session will be modeling and embedding thanks hey thank you very much so here we have a presentation from one of the presenters who's unfortunately is is locked out in the other room uh you will find her in the process session i think Diego is around to uh tell you a little bit about the work i am so sorry for Katrina we're doing everything we can to to solve this problem but luckily we have for interactive process session so you can ask anything you want to her Diego are you there oh yeah can you hear me yeah thank you so much for covering for her thank you yeah no no this is a pity this was first presentation for Katrina conference but for during the poster session you can ask her more questions so this was a work that was Katrina's master thesis and uh was about matching Ukrainian red links with English Wikipedia articles so the idea here is to uh please can you pass to the next slide medium please so the main idea here is that um we want to take uh red links on the Ukrainian Wikipedia and then we can generalize these two other weeks but this is a use case with the Ukrainian Wikipedia the red links and try to match to English Wikipedia so this problem can also be generalized as taking any red link in any Wikipedia and match with a wiki data item and then use this for for example translation or for pointing to the red link in the to the link in an existing article in another wiki so uh Karinia's work was basically in trying different strategies for doing this kind of matching and we are also contributing with a data set of red links in the Ukrainian Wikipedia that match to English Wikipedia so there was a monocoration work there and you can see on the paper on the link to the data sets and also the link to the codes of this but um basically uh the main strategies that we tried uh Miriam can you go to the next slide please and consist in using uh uh bevel net as a as a baseline so basically we're taking the word in bevel net and try to see which uh which entities it detects in other wikipedia's and then we also tried a graph approach using uh the links uh that are surrounding the red links and comparing with the ones in English and we also tried something uh something more easy that was uh working with the 11th time this or edit distance basically translating because the Ukrainians reading in Syriac but we translated this to latin and then to compare uh the and then we compared the the the edit distance between the links on wikipedia and sorry in English or in Ukrainian we also tried with multilingual word embeddings and and surprisingly this the the best result we would we got with the 11th time distance so basically that's mainly because most of the red links on the ukrainian wikipedia we found were nouns so proper nouns or on uh places so in that case the 11th and distance works uh the best but uh for sure the combination of all these models uh worked better so you can talk more in details about experiments with uh with katerina you also have in this uh in slide the links to get the dataset uh we think what one of the main contribution of this work is uh the dataset we are sharing here so you can apply your own techniques for uh aligning uh red links uh with existing articles in wikipedia thank you very much and please talk with katerina during the poster session later thank you thank you Diego for covering for katerina there might be a chance that we can get the two students who were left out from this this call to present in the in the final fun session just because they deserve uh some visibility and to present to a broader audience but let's see so i've spoken enough today i think a tiziano you're next to talk about a wiki text yes okay hi everyone i'm tiziano picardi piece the student in the data science lab at epiafalla and i'm going to present you uh wiki east dot html uh that is a dataset of the full history of english wikipedia in html there are many reasons that motivate this project but since we have only three minutes i want to focus on one question can we rely on the wiki text to get the links on wikipedia page this is an important question because many research projects rely on the links uh network uh to study for example the evolution of wikipedia navigability or to use the network property to train other downstream models unfortunately the answer to that question is no because as you know wikipedia is written in wiki text and converted by media wiki in html uh in this example newe is a island country you can see that despite of only the test island country uh is a wiki test link encoded with a double uh square bracket in the html we have two links the additional one is uh the is marked in red in this case it was created by a template this is caused by the flexibility of media wiki that allows templates and the sternum module to inject links into html of the page so for this reason we and to have a full picture of the wikipedia links network we converted the full history of english wikipedia into html we created for uh to do this a large parallel architecture uh that acted as a sort of time machine uh to render the html by using the exact template version available at the time the editor created each article revision so what did we what did we discover with html dump by comparing the two dataset html and wiki test we found for example that seven percent or 1.3 million element the what you see in the red square of the transition contained in the public wikipedia clickstream wouldn't be possible if you look only at the link network generated by wiki text second we also compare um how the average number of linking up on a page change in time and we notice substantial differences for example the gap between the two um blue course showed that in uh 2019 the html the average number of link in a page by looking at the html version is three times more than the links you could get from wiki text uh if everyone is interested we release a public latest dataset it's a seven tera and it's released on internet archive thank you mark and thank you for your attention if you have other question i can answer in the process session great thank you titsano and thank you mark again and next up is uh wiki gender and machine learning model standard on Wikipedia so yes is Sophia right yes yes please the floor is yours so hello i'm Sophia i'm a master student at tbfl and i'm here to present our project wiki gender a machine learning model to detect gender bias in wikipedia as you may know wikipedia is visited by more than nine billion people per month and its articles are edited by volunteers around the world which means that sometimes subjectivity and bias are reduced as uh okay as many people during this conference were interesting in bias in our case the aim of this project is to find out if there is a difference in the way people describe men and women wikipedia and identify which words are creating the difference the biases is explored in two ways first we analyze the which topics are more likely to appear depending on gender and then we also study the bias in terms of the subjectivity to reduce through the usage of adjectives if you can please move the slide okay so the dataset we use is the overviews of the wikipedia biography articles there's almost 1.5 million biographies but almost 17 percent of them are about women from the overviews we break the vocabulary with the most common words excluding stop words and using a neutral version of those that include gender that way each biography is represented with a bug of words a binary vector which indicates the presence or absence of the words of the text then we balance the dataset there is a wide variety of occupations and many of the words and the overviews were related to them so to come with it we balance the dataset by occupation keeping the same number of entries for each gender then the model takes as input the content overview and uses logistic regression to assign the probability that it belongs to a woman an accuracy score higher than 50 percent the task of gender prediction will reveal the presence of the bias moreover the uses of this simple model allow us to detect which fixtures in our case words are more associated with each gender and here are the results the model achieves an accuracy of 54.6 percent by using only adjectives to get a better understanding we extracted the most predictive adjectives for each gender using the subjectivity lexicon we discovered that women tend to be described with more positive and strongly subjective adjectives while men are described with more negative and weekly subjective adjectives when including the nouns we got an even higher score again we extracted the most predictive words and used the mbath library to check the topics related to its gender results show that overviews of women are related to family whereas the one portraying men are mostly related to business and sports if you're interested about this project you can join us in the poster session under the topic knowledge gaps and content reviews feel free to visit our site by scanning the qr code thank you thank you so much Sophia and next one i think it will be ico yeah so ico the first author of this paper is one of the two students who's locked out of these meetings so we will try to recover for that but in the meantime i can try to substitute her although it's not possible because she is fantastic but i'll try to do my best so ico did an internship with outreachy with us with Guillermo Sam and myself the last quarter so a few months ago and the aim of the internship was to essentially productionize a machine learning model that we developed at the wiki media foundation some time ago that given a sentence can in wikipedia can automatically detect whether the sentence would need a citation or not and so basically she designed this whole system that automatically scans through a large set of random articles extracts all sentences and then run the citation need model that would then expose and surface those sentences in this big chunk of articles that would need citations and so this is then deployed in a data dump that is available on the wmf the tool forge the tool forge is a tool forge database and it's publicly accessible and the dumps resulting from this citation detective tool has been already integrated with one of the main tools that people use to fix sentences need in citation which is citation hunt and so basically citation hunt would surface sentences tagged as citation needed and ask people to fix the citation needed tags now citation hunt will also have these sentences which are automatically tagged as citation needed in some way and also one of the main research application of the citation detective is that we can finally quantify citation quality at scale given the fact that we can now apply this machine learning model to large number of articles and so Ico actually is doing her master thesis on this now and basically by being able to compute for each article a citation quality score based on how well sourced the sentences are so the proportion of sentences that are sourced versus the number of sentences needed in citations and so initial results show that basically when we look at citation quality broken down by topic we found that among the most well sourced articles we have articles in biography in biology and medicine and especially biography is very important because we know that the resolution for biographies of living people puts a special attention on citation equality for these biographies so Ico will be available in her post-it session please ask her anything she is going to she will be happy to reply to your question and you might get the actual presentation later on in the final session sorry for not being able to replace her as well okay and the last presentation of this item talk session will be from Mari and Mari are you there yes hello everyone I am Mari a PhD student in Avis team at Inria working on visualization following data Jean-Danielle Fecchetti is the head of the team and my advisor so the problem we target here is that data producers you know could you go to the next slide please Mari so data producers in order to increase the quality and ensure the best level of completeness for the data they need to diagnose when an information is missing for good reasons like a person has no date of death because she is still alive when it can be fetched from an external source or when it can be fixed like there could be a bug in the mapping process and all dates of death for let's say French persons who have disappeared from the data so we posit that considering entities in small groups based on the information that is missing can help identify those reasons our tool the missing path performs an analysis of the data to populate a vector for each entity with Boolean values indicating if a path exists or not then we project the vectors into a map you can go to the next slide so let me now describe the interface from the point of view of a wicked data contributor who wants to create entities belonging to the class comics she opens the tool and sees the map in step one as she moves her mouse a cluster is highlighted for which many pieces of information that are important to describe comics are missing such as language control of origin publisher publication date and genre she decides to inspect this group in more details in the histogram in step two some paths are colored in pink indicating that their summary might be significant one of them is rdfs label she notices that out of 22 labels 21 are in French another is schema description its summary shows that out of 22 description 21 are in Dutch a value is repeated 20 times street ferrule from hobbit us and verbal notes comic strips pu and fantasy in dutch 20 of those entities are part of the same series so this group of entities appears to have very similar needs according to a quality standard labels and descriptions should be available in similar languages and not labels in French only and description in Dutch only as it is the case from what she knows spirit one fantasy of comics are known enough so that it should be easy to find also language publisher and publication date it is probable that the information will be available from the same source at least for some of the entities and this source might even be the uri of the series from which all entities are part so it really sounds like she will save time by fixing those entities all at once so i will have to skip all the details about refining and checking your selection before she clicks the export button but i will finish by saying that the interface as described in this use case is the result of an iterative design process we conducted with nine data contributors and the user study will soon be related in a full paper i will be happy to answer questions during the poster session in the modeling and empathy chat room or by email our emails are on the first page of the slide and i would like to thank the organizers and reviewers that's all okay i am not busy anymore thank you very much very thank you very much all the seven of you and that's that made it to the light and talk uh that brings us to the end of the contribution