 No. OK, we're live. All right. Hello, everybody, and welcome to the June 2017 Wikimedia Research Showcase. My name is Jonathan Morgan. I'm a design researcher at the Wikimedia Foundation, and I'm going to be hosting this month's showcase. This month, we have two researchers presenting. First up will be Alan Lin, who's a second year PhD student in computer science at Northwestern University. And he's going to be presenting his research, Problematizing and Addressing the Articles Concept, Assumption on Wikipedia. We also have Marcus Kretsch, who will be joining us, hopefully. We're debugging some audio issues with Marcus's computer. But if all goes well, he will be presenting second, his research on understanding Wikidata queries. So without further ado, Alan, would you like to take it away? Yeah, sure. Thank you, Jonathan. And good morning, or whatever time zone you are in, because I know probably the audience are coming from all over the world. I am Alan Lin. I'm from People, Space, and Algorithm Computing Group at Northwestern University. This presentation is based on our CSUW 17 paper entitled, Problematizing and Addressing the Article as Concept Assumption in Wikipedia. We did this work with our collaborators at Gruplen's Research University of Minnesota. So what is this article as concept assumption we're talking about in Wikipedia? Oftentimes and implicitly, Wikipedia-based studies and systems would assume a one-to-one mapping between concepts and Wikipedia articles. For example, if you want to know about the concept Justin Trudeau, then you will go to his Wikipedia article that has the same title and the same theme for Angela Merkel and Theresa May. While this one-to-one mapping sounds very straightforward and intuitive, it fails in many occasions. So what is the problem? We will give two such examples. In the first example, imagine that you are interested in comparing different language editions of Wikipedia, which is a familiar topic that lots of Wikipedia research have investigated. Conveniently, let's focus on the most viewed Wikipedia article, which is Donald Trump. And let's say that you are comparing how English and Chinese Wikipedia describe his family information differently. So we first go to his English Wikipedia article and find a section about his family information and summarize the statistics. For example, we might want to count the number of words and how many percentage they accounted for in this article. And we do the same thing on Chinese Wikipedia. So looking at these numbers, we will probably reach to the conclusion that English Wikipedia and Chinese Wikipedia have almost equivalent amount of family information about Donald Trump. Please note that this is what we do under the article as concept assumption, because we assume that all content about Donald Trump, including his family information, will only appear on the article of Donald Trump. However, this assumption is highly problematic in this case. If we look at the red box, we find that there is a Wikipedia template that links to a completely separate page, which is entitled Family of Donald Trump, which sounds exactly what we want. And even more interestingly, this link only exists in English Wikipedia, and there's no such link nor article in Chinese Wikipedia. So if you follow the link and read the articles, you will probably want to update your statistics and incorporate information from this page. Now we get a more realistic picture of the comparison. The English Wikipedia actually has far more information about Trump's family than the Chinese Wikipedia. From this example, we see that if we adopt the article as concept assumption, we will miss the content about the concept that are not on the article page. So we just demonstrated the problem when a multilingual Wikipedia study adopts the article as concept assumption. In the next example, we showed what happens if this assumption is adopted by artificial intelligence which uses Wikipedia data. Nowadays, lots of AI systems understand the world knowledge through Wikipedia data sets. And lots of them do so under the article as concept assumption. We will look at a famous Wikipedia-based AI algorithm, merely we tend semantic-relatedness algorithm. Given a pair of concepts, merely we tend output a single score of how related they are. And it does so by comparing the shared incoming and outgoing links of the two Wikipedia articles of the two concepts. So if automobile and global warming has large number of shared links, their semantic-relatedness will be high. Again, going back to our previous example, let's compare Donald Trump's semantic-relatedness with, I'll just pull something off the top of my head. Let's say Vladimir Putin. And here is the link graph between the two Wikipedia articles. We could see some sharing some incoming and outgoing links such as Rex Tillerson, APEC and G20. Please note that here the link graph implicitly adopts the article as concept assumption again, saying that in links and outlinks of the concept, Donald Trump can only be in the article of Donald Trump. However, this assumption is problematic again. On the Donald Trump page, we found a link to another Wikipedia article dedicated to the foreign policy of Donald Trump. Given Mr. Putin and Mr. Trump are both politicians, they might share a lot of common interests on international issues. And indeed, analyzing that page will add more shared outlinks to the original link graph. For example, both of them commented heavily on the Ukraine and Crimea issues, and both of them are critical in the Russian-US relations. So without this newly added shared links, the semantic-relatedness computed from the original link graph might be inaccurate. Given semantic-relatedness has been widely used to support different technologies, the arrow might be propagated and magnified. So from the above examples, we should realize that the widely adopted article as concept assumption is problematic in some locations. Besides the article that corresponds to the concept, there are other articles that describe the concept, each of which focuses on the aspects of the concept. In the following of this talk, we will call the article that directly corresponds to the concept as parent articles, and the article that describe the aspects of the concepts as sub-articles. To address the problem, our work builds a system to avoid adopting this article as concept assumption. Given a parent article, our system will try to retrieve all its sub-articles. Our system does so in several steps. The first step is to filter sub-article candidates. As we just mentioned, our system retrieves all the sub-articles of a parent article. A brute force solution is to pair the parent articles with every other Wikipedia article and classify if this pair has sub-article relationship. This entails comparing, for example, Donald Trump with other five million plus Wikipedia articles, and obviously this is very inefficient. So how can we reduce the number of comparisons? According to our definitions, sub-articles usually described in aspects of the parent article and are usually spin off from a particular section. So we started off by looking at different sections on the parent article. For example, the section of early life and the section of the family of Donald Trump. In these sections, we've found some Wikipedia templates that link the section to all their articles, and these templates are essentially wicked texts and are used behind the scenes by editors to indicate for the readings. Examples of such templates include the main article template and the further information template. We use these Wikipedia templates as indicators to filter the sub-article candidates. These Wikipedia templates are just one type of indicators. We also find that some sub-article exists in the see also section, which is at the end of an article. This is usually for the sub-article that does not have a corresponding section in the parent article. And finally, we want to highlight that the sub-article candidates filtered by these indicators are still very noisy. For example, in the see also section, only one candidate, public image of Donald Trump is a true sub-article to Donald Trump. And we observe the noise in the template-based candidates as well. Even the same template could contain both true and false sub-articles. Due to these noises, we need to classify it to correctly identify the true sub-article relationships from the candidates. To be able to carry out the classification, we first collected the ground truth dataset. We collected three datasets that represents articles with different characteristics. The first dataset is what we call high interest dataset, which contains the parent articles sampled from the 1,000 most viewed Wikipedia articles. In this dataset, you will see parent articles such as Donald Trump, his wife, famous movie, and so on. This dataset is the most important dataset since it reflects the user's interest. Random dataset contains the parent article sampled completely randomly from Wikipedia. Due to the skewed distribution of article quality, this dataset contains largely articles that are both less popular and shorter. This dataset tells us how well our classification performs for the general Wikipedia articles. And finally, we sampled parent articles from more important concepts in an ad hoc fashion, and this dataset serves as the feasibility test. After we've sampled this parent articles from the three datasets, we randomly sampled the sub-article candidates using the indicators we described in the filtering process. For each of the three dataset, we sampled from three vastly different Wikipedia language editions, English, Chinese, and Spanish. The aim is to test if our classification performs equally well on different languages. For each language editions, we recruited native language speakers to rate if a pair of parent article, sub-article candidate is a true sub-article relationship. In total, we collected 1,200 pairs of parent articles and sub-articles candidates. The step three is to do the classification. For the classification tasks, we need features. We computed eight features for each pairs of articles to be classified. We have tons of fun generating these features, but due to time limits, I could only highlight that. We designed them into three groups. The first group of features signaled the relative importance of the two articles. For example, we compute the ratios of the incoming links between the parent article and the sub-article candidate. The intuition is that parent article should be more important than the sub-articles. The second group of features are based on implicit linguistic signal. For example, one feature is computed by counting the term frequencies of parent article's title in sub-article summary paragraph. The intuition is that in true sub-article relationships, sub-article will mention this relation with the parent article upfront. In our previous example, when compute this feature between Donald Trump, which is a parent article and foreign policy of Donald Trump, a sub-article candidate, we compute the frequency of the parent article's titles, namely Donald Trump, in sub-articles first paragraph. Frequency is two, and we do this for the top 25 language editions of Wikipedia and take the maximum. Group of features are based on the explicit Wikipedia templates that we previously used as sub-article indicators. One such feature computes the ratio between the number of language editions where main templates are used to link the parent, the pairs of articles, and the number of languages where these two article coexist. The intuition is that if more language editions label these two articles have sub-article relationships, it's more likely that this is true. And back to our example, main template is used to indicate the relationship between Donald Trump and the foreign policy of Donald Trump in English and French Wikipedia. And the two article coexist in five language editions. So this feature is computed as two out of five. Using the features that we just introduced, we trained the model on a variety of popular machine learning techniques with cross-validations. This is a performance on different datasets. For simplicity reason, we only present the machine learning techniques with the highest accuracies. Details of the results on different machine learning techniques could be seen in our paper of the same title. So let's look at the diagram for a second. The x-axis represent different types of datasets, and the y-axis represent the classification accuracies. And the blue bar is the accuracy of random baseline, which just randomly predict the label of two or four sub-article relationships according to the prior distribution in the training dataset. And the orange bar is the best accuracy of our model. The focus of our interpretation is the margin between the blue bar and the orange bar, which shows the improvement of using our model. So on the high interest dataset that contains article in high demand by readers, and the ad hoc dataset that is sampled from more meaningful concepts, our model outperforms the baseline consistently and substantially. However, on the random dataset that contains articles typically of lower interest and quality, our model does not improve much compared to the baseline. Part of the reason is that prior distribution of labels in the random dataset is very skewed. So just by predicting all labels as false, it could reach a pretty high accuracy. Since high interest dataset is the one that draws the most readership and the tensions, we further unpacked our model's performance on this dataset. On the English and Chinese Wikipedia, our model achieved 90% of the classification accuracies. On the Spanish Wikipedia, the improvement is not that significant probably due to the same reason, the high baseline accuracy. So at this point, we have a system that could relatively accurately retrieve all sub-article for our parent article, at least for the most popular English Wikipedia. And previously, we shared some anecdotal examples of where article as concept assumption would fail. Now by running our system, the 1000 most viewed English Wikipedia article, we could evaluate the scale of the failure of this article as concept assumption. And the result shows that 70% of the most viewed English Wikipedia articles have at least one sub-article, meaning that they will be affected by this assumption. And that's a huge number. If I am building a user-centric Wikipedia-based systems, I should expect that my users will usually interact with concepts whose information is distributed across multiple articles. And I will probably need to address the article as concept failure inside my system. And finally, we argue that the article as concept assumption and its problem is just one instance of the audience and other mismatch problem where peer-produced data are used in research or artificial intelligence. The audience-other mismatch problem originates from communication studies and it describes the problem where human author failed to customize the content to the needs of their audience. Simple example would be me, an international student as an audience, has trouble to understand the best Superbowl ads. What's app? Because my contacts is significantly different from that of the ads. In the article as concept case, Wikipedia editors mainly wrote to human audience who can make easy judgment on whether to click on a link in the parent article to look for additional information. However, when machines as audience try to leverage this dataset, they require a more standardized way to indicate what other articles belongs to the same concept. Finally, I want to discuss the implications of this project in the context of Wikidata. We think the article as concept assumption is particularly relevant here. For example, Wikidata is now the backend for the other languages link on Wikipedia article. And note that there's no link on Chinese Wikipedia in the language box for the English Wikipedia article, Family of Donald Trump. Because the information about the family of Donald Trump is on the article of Donald Trump itself. As such, the reader of the family of Donald Trump would miss the large family section in the Chinese Wikipedia article, Donald Trump. Fortunately, Wikidata started to address this problem by introducing the facet of property in 2014. However, this process requires huge amount of crowdsource labor and it seems that the process has been very slow. For example, none of the Donald Trump sub articles we just showed, for example, the family of Donald Trump or the foreign policy of Donald Trump, et cetera, is a facet of Donald Trump. Here is where our system could help to expedite this process. There's potential to set up our systems as recommended to Wikipedians who are working on the facet of properties. So our work has a couple of limitations that I want to go over. First, in this work, we filtered the sub article using indicators that are explicitly encoded by Wikipedia editors. Future research should look into sub-article relationships that are not explicitly encoded. And secondly, although our model performed well on the most popular dataset, it struggled on the random dataset which contained the many lower interest articles on Wikipedia. And future work should focus on developing a more focused model that works well on this dataset. And lastly, we look forward to integrating our system to other Wikipedia-based technologies such as Smiley Wheaton and to check the improvement. To summarize, we identified and problem-tized the article as concept assumption, discussed its risk for Wikipedia-based studies and systems. We addressed this problem with relatively high accuracy, especially for the Wikipedia articles that attracts the most attention. And we discussed the broader implications of the machine-human version of the audience rather mismatch in peer production systems and argue that this can describe a growing number of challenges in artificial intelligence trying to leverage semi-structured peer-produced datasets. The system that we build to match sub-articles and the training dataset we collected are released so that the researchers or system builders can more easily avoid adopting the article as concept assumptions. And to find a link, please again refer to the CSUW paper under the same title. With that, thank you for listening to my talk and we should open up for questions. Awesome. Awesome. Thank you very much. Thank you very much. So I have a couple of questions from IRC. One question first from E. Bernardson. He wanted to get some clarification on how your training data was labeled. Sure. So let me go back to the page of the training data. Oh, all right. So I have some additional slides talking about the training data actually. Let's look at that. Okay, so as I said, we have three datasets and these three datasets are parent articles of different characteristics. So we have high interest datasets. We have the ad hoc datasets and we have the complete random datasets. So we first sample parent article according to the criteria, this three criteria. And then using the sub-article indicators that we described, you know, those templates and the COSO section at the end of the article, we for each parent article, we sample what we call sub-article candidates. And for the labeling process, for each, so now what are in our hands are pairs of parent articles and sub-article candidates. And then for each language edition, English, Chinese and Spanish, we recruited native speakers, a pair of native speakers to rate according to the protocols that are showed up on the screen right now. So we asked them to rate actually not of only true or false, but other scales between zero to three with zero means that these pairs of parent article and sub-article candidates have no means to have sub-article relationship. And three means that these pairs of parent article and sub-article candidates have like 100% short sub-article relationship. And one and two between sort of like the middle ground. The reason that we had this multi-scale labeling process is because we actually in our system, we building the different scales as well so that some system builders would think that I want to streak her sub-article definition. So I would only adopt three and some system builders or researchers probably think that I want a more relaxed definition of the sub-article relationships. So I would probably use a one or two in my system. So for the labeling process, we have this multi-scale labels so that we can play it around with the strictness of the definition of sub-article relationship. Awesome, thank you. Awesome, thank you. Next question is from Gessler Giver. He asks, great work. I'm wondering how high interest articles were chosen and if these were focused across the language study and I wonder if these articles were less of interest to this language addition. Alan, you can just, sorry, I'll unmute yourself. We were getting some echo from you, but no, we're not getting any echo. So I can answer the, what was the second question again? Sorry. Sorry, we're getting a little echo. Computer McGyver asks, I'm wondering how high interest articles were chosen and if these were of equal interest across the languages studied. The Spanish baseline seems high but no, we can't hear you at all. Then, all right, now, go ahead. Can you hear me now? Oh, okay, sorry, I just said, I'll answer how this high interest data sets are sampled first and then I'll answer the question about the language differences. So the high interest data sets are, the parent articles are randomly sampled from the 1,000 most viewed Wikipedia articles. And now I have to mention the very useful tools built by Wikimedia research teams about the API that get the page view and the rank of the page view. So we grabbed the 1,000 most viewed Wikipedia articles and we randomly sampled some artists, a subset from that. And then for each parent article, again, I just want to clarify for each parent article we went through the, we gathered the sub article candidates using the indicators that we just mentioned, you know, the templates and the CLC section. So that's how the high interest data sets are sampled. And the second question is about the language differences because what we see is the Spanish, our classification performance on Spanish Wikipedia has very high baseline. So there are some overlaps. So first of all, there are some overlaps between the 1,000 most viewed articles of Spanish Wikipedia and English Wikipedia. So for example, Donald Trump is a most viewed Wikipedia articles on both English and Chinese and Spanish Wikipedia is probably not the most, but it's like among the top five. So there's a huge amount of overlap of the parent article. However, the higher baseline of Spanish Wikipedia actually most comes in because the different ways the different ways that Spanish Wikipedia use the templates and the CLC section. So it seems that different language editions have different amount of CLC sections and different amount of templates that are used in its own language editions. And it is the different ways of using those templates that affects our distributions. Awesome, thank you very much, Alan. The final question we can probably mostly handle offline as Malychev on IRC asks, is there open source? Where can we see it or use the results? Alan, if you have some links for that, I'm happy to add them to the showcase page. If you can just email them to me or send them in the sidebar. Yeah, sure. Thank you very much, Jonathan. All right, thank you very much, Alan. Pleasure. So next up, we have Marcus Kretsch who works at the technical, let me make sure I get the title right. Professor of Knowledge-Based Systems at the Faculty for Computer Science at the Technical University of Dresden. And he's going to talk to us today about understanding Wikidata queries. So thank you, Marcus. I hear we got your technical difficulties resolved. So I'm looking forward to the presentation. And with that, we should actually test again to make sure that Podio is working. Can you unmute yourself, Marcus? Right, so I hope you can see my screen, which should duplicate the Google Hangout now and I can now move into the presentation. So thanks for the introduction. Now, this work I'm going to talk to you about is not as finished as the one we just saw. So this is a progress report on an ongoing research project where we are still in the middle, I would say, of something where we want to gather more information. And I would like to also have feedback from the community, if possible, on the directions that we should pursue most in the next steps. So that would be very helpful. Now, what I would like to talk about is understanding Wikidata queries, which is a project we have started as a research collaboration with the Wikimedia Foundation a while ago. It will hardly be needed to introduce Wikidata. We just have heard about it a bit, but I do have some slides on it just to make sure we are all on the same page. So you do know, of course, that Wikidata is the official Wikipedia database which has been live for a few years now at Wikidata.org and is widely used in many Wikimedia projects. It's quite large. It's been quite successful in attracting editors. It's one of the most edited Wikimedia projects. It also has collected more than 26 million items by now which is a multiple of what we find on any Wikipedia including English Wikipedia. So it's considered in a certain sense a superset of the Wikipedia editions, of course containing only certain types of data. Now, what it contains is relatively rich. Again, you will hopefully have seen it and know a bit about it just to give you a short insight here. I have collected some extracts of a page on Douglas Adams. There are a lot of terms and language related information on Wikidata. So we have the name of things in many languages. We have aliases. The main part is what I have briefly shown here in the middle of the statements, the actual effects you might say that we record there and then there are all these inter-language links and project links to other Wikipedia projects to make sure that we are really connected with the rest. Now, as I said, the main parts that people usually think about when they talk about Wikidata are the statements. And in the simple forms are very easy to understand, I would say. So one would go to a particular item, for example, Douglas Adams and have a property given their like date of birth with a particular value like this date here in the example. And there are of course references because in Wikimedia we do value references and we want to have sources for our effects. Now, this is all fairly simple and straightforward you might say, but if you look at some other parts in Wikidata, you can see that there are also more complex pieces of information. Like in this case here again, from the Douglas Adams page, we have a spouse information given with the value, Jane Belson, but then there's additional information on this statement in the form of a start date and an end date, the date of the marriage and the date of Adam's death in this case. And of course, the references as such can also be expanded and have a lot of sub-information attached where you can see the details about the reference. So it is fair to say that the data we are dealing with is fairly rich, even though of course it's not free text and has a lot of constraints and people have to encode facts in this format in some way. Now, a few years after Wikidata has started, it has been extended by an actual, so this is just how we call these things. It has been extended by an actual query interface. So for many years or several years after the start of the project, it has not been possible to process the data which was gathered in very complicated ways. You could basically download it and have your own software analyze the data, but there was no way to directly ask what is available in this database for example. And so this is something that has changed a while ago and after a lot of consultations and a lot of technical work in this direction, Wikidata has received a public query interface which is a kind of web service that allows users to ask complex questions over this data. Many people will be familiar with databases. Many people will also be familiar even with RDF or graph databases. Of course, in a way what we do here is rather similar to a database API exposed on the web. So people and traditional databases would consider this insane if you would open your SQL interface on the web and let anybody ask questions on that because usually you would expect that this will be too much consuming performance and energy and administrative work, but it turns out that in certain technologies it is quite possible to offer a very stable query service in such a very open way without any constraints on the type of query set you can ask. The queries you can ask indeed in this case are very general. So the underlying data format that is used is not directly Wikidata as a website or even it's underlying JSON format but it's rather a graph-based data format which is called RDF, it's a resource description framework. And this graph-based framework is moderately intuitive one might say and is very well suited for representing a very heterogeneous, very strongly interlinked data that does not have a very fixed and rigid schema as you would find it in traditional relational database instances usually. And the query language used on top of this then is not SQL as you know from relational databases but Sparkle, the Sparkle protocol and RDF query language standardized by the W3C which as such is a nice technology already because it is rather open-sourced so it is an open standard which is accessible freely for anyone who is interested which also offers a good bit of open-source implementations such as the one used here in Wikidata which is called Blazegraft. So it's very nice and fits our ecosystem. If you do not know Sparkle this time here for the talk is certainly too short to give you a proper introduction but I can at least give you a flavor of how Sparkle queries might look. So they can be very simple they can also be very complex and analytical and I have here an example which is somewhere in the middle so if you want to know where I'm from and you somehow managed to remember my name from the introduction you know that it does end in a T-set-S-C-H combo of consonants which seems rather strange to the foreign ear usually and you may wonder where do such people come from and as a query you see here basically is the one for the coordinates of birth places of German born people whose last name ends in T-set-S-C-H. And you can see here whilst it's the naming of the variables with the gestive so there's a person and the place they are somehow connected with something called P-19. P-19 is a place of birth the property used on the data for that. We do you see numeric identifiers here on this level because we are on a technical interface and we do not have a pretty UI on this level even though there are pretty UIs for query answering ask it. And then the place is somehow located in something called Q-183 this is Germany located in this P-17 or it's country in this case and then this person has a label which is the name and we are only interested in a German label and we filter from those names which to end in this T-set-S-C-H. And then finally we ask what's the birthplace that does actually have in terms of geographic coordinates and we return that as a query. And if you do that you can for example create a map from that which would give you something as you see here on the left where you can somehow clearly have a good hunch at least at what my home area is. And also you can cluster these things groups these things for example by German state to see which state are people called like that usually from. So you can see that there's a fair bit of trivia also to be found here of course but also many things which can be interesting to many people in various contexts. What you can also see from this example is of course that this is not particularly easy. So if you want to really write sparkly queries you have to learn a good deal of technical specification first and see examples and modify them and work through that if you really want to do that. As such it was not at all clear when this service was introduced how much it was really going to be used because it might have been that there's not so much interest in querying the data. After all you can also access it directly through the API you can directly access it through Wikipedia's by using parser functions there and of course you can just read it on the pages. But it turned out that indeed there is a lot of interest and a lot of queries are being asked. Now what this gives us in a way is rather unique opportunity of understanding our user interests in Wikimedia in a way that we cannot currently do with for example Wikipedia. Because when somebody reads a Wikipedia article we have basically no idea what they are looking for what they are interested in why they are reading this article. There might be techniques to also approach that but it's not obvious. If you are not with the person in front of the screen seeing what the person's actually reading. Whereas with Sparkle you have a very detailed very technical and formally standardized meaning of every single query and you can really exactly more or less say what this query is asking for and therefore you can try to understand what is important to users. What are users interested in. Are they actually interested in Wikidata but also are they interested in specific subfields of Wikidata. It's very very inclusive. We have everything from genes to movies to fictional characters of course a lot of fan fiction as usual but also a lot of other information about specific specialist topics. And of course all of the communities involved in this and in the creation of this basically are interested in understanding whether users actually care about this data being or whether it just sits there and is never used again or read again. And queries may offer us some approach to this. Another thing that we might learn from the queries of course is what technical challenges are involved in offering such a service because this is very important. We want to scale up further in terms of users in terms of applications based on Wikidata. So we may expect that even more queries will be asked in the future. And in order to build an infrastructure which supports this it is important to understand how queries look typically in such a situation. And this is all material because Spark is a very complex language. It has a large number of features, very different types of features of different difficulty if you want to support them in a large scale. And it is very important also for developers to know which of the features are relevant for optimizing. So in this sense we may also beyond the content get some technical information out of analyzing such queries. And this is both of these questions basically that have triggered us in creating such a research project and looking into the queries. I already have talked about the benefits that you can create here on the slides. So this should hopefully be clear. Okay, so this is the vision and the goal and the dream that you want at the moment. Now it took us a while to have the negotiations with the Wikimedia Foundation to actually get access to that data because this data is not publicly available. Which is natural and good because it may contain potentially sensitive information. After all it's access data to Wikipedia or Wikimedia service. And it is very important that this access data is kept confidential and not leaked in any way. We want to protect the identity of our users for example who may share, may only use IPs, right? May not share the identity in any other way and maybe for good reasons. So the locks we do have available now after getting access to them do contain little sensitive information I would say but may still contain some sensitive information. Most sensitive of course are IP addresses which identify users on the internet which is also the part of the information which we are least interested in currently are not actually accessing even. But there are also exact URL patterns which includes the query string. So the URL which was called to access the server. So our timestamps when exactly was the query sent and there is user agent information the browser used and possibly the tool or the program used if it's cares to specify that in a good way. And of course if you combine this information and do statistics on top of it you may infer information about users in indirect ways which we don't know yet. Even though one might hope that this would be a rare exception because really what is being asked here is only public information of course. So the queries as such only concern publicly available data. It's not like in Google where you can Google for your own name for example unless you're a person who has a Wikipedia article you cannot find yourself in Wikidata. So this is not possible to really ask for personally relevant information such as your address or your name. But of course one could imagine cases where some personal information is in the queries. Now because of this we have very special requirements as I already mentioned it takes a while to get access to the data. We cannot really share it. I can also not freely show you everything we can statistically tell about the data so far because I first need to get legal clearance to do that. It also means that the lifetime of the data we see is limited. We cannot get more than a few months usually because Wikimedia deletes all the locks after that time again for good reasons. And also of course there are special technical requirements. We can't just download the data in a large file and process it in our own servers or buy a server to process it faster. We have to work on the service at Wikimedia behind firewalls and to make sure that really none of the data can escape to other places without first being screened. Okay, so this is the basic setup. And now the question is what do we need to do? What can we do? And as I said it's a progress report here so I can only tell you about the intermediate state. So first of all of course the server locks itself are not a set of Sparkle queries. I mean we are interested in the queries. We want to know what people have asked but what we get are strings. So the strings in order to interpret them in Sparkle must be passed because Sparkle is complex like any programming language would be complex for example a query language in this case. And therefore you cannot just do that in an ad hoc way. You use libraries. In our case we used to open source libraries open all the F and Apache Jenner. We used both because we were not sure which one would be faster and which one would be more accurate. And indeed it turned out that they disagreed in many places. So it's not the case that every Sparkle query is considered a Sparkle query in every library but rather there are certain cases where libraries don't agree some of them are more tolerant towards some variations versus the standard others are more strict. And this is not necessarily just the case for all queries but you might be strict on some cases but more liberal on others. And therefore currently we do not know of any library which interprets really the query strings such as the query service does. So we cannot say for sure that we really have 100% of the queries but we do think that we have at least access to like 98% probably of queries or more to interpret them in the same way as the query service would interpret them. Then there are some corner cases where we cannot be sure. So one has to be aware here that this is not simple data. This is not just nice textbook queries which you can use but of course there are also a lot of invalid queries around a lot of truncated data or simply errors in the URL. I mean we do not get to see queries which have been answered successfully we get to see all queries. So there are also cases where the query was simply focused and has to be fit. So there's a lot of noise. Now the next step is set of course from Sparkle even if you have access to the queries in a structural way even if you really understand their contents, their shape and the conditions which are in the query and have a timestamp for all of them you still don't have insights. And this was a part which was more surprising to us because what we actually expected before we knew anything about the data I mean we couldn't even know how many queries that would be before we had signed the NDA sensor. But before we had looked at the data we expected that there would be large numbers of users which was actually true. And we thought also that because of these large numbers variations would even out. So over weeks and months you would more or less see the same level of activity with more or less the same patterns. Not because this is true for every single user and every single tool but because we have such a large number of users and overall there's kind of an average. Now this was quite wrong, very wrong indeed. I mean when we looked at the data we find that there are extreme variations in terms of load over time sometimes from hour to hour sometimes from day to day sometimes from month to month. We have for example in May we have almost twice as many queries as in April. For reasons we cannot explain it's just that there were twice as many queries maybe somebody became very active. And clearly with such a variation in load if you double the number of queries then suddenly of course you will also have different characteristics you can no longer say on average queries are like this and that because average in May is completely different from average in April. And this was a methodological challenge which we are still not fully sure how to overcome but we have some approaches. So a major task we identified in order to deal with that was to interpret bot queries because in contrast to other usage patterns where people have analyzed uses of web services for example like for example Google might have done that a lot and other majors will do that and people may also analyze Viki data and Wikipedia access for example in terms of most bad articles but in all of these cases usually the majority of queries is fairly uniform in terms of the source. So they're usually humans looking at web pages or triggering requests in a search and then you get some results from that. Whereas tools are usually operating through another API. Now in this case with the Sparkle query service we have everything going through one web service and we cannot distinguish. We get queries from bots which are automatically generating millions of queries per hour. We get queries from individual users who are just clicking on a single link in an email and we get anything in between. And it turns out that it just doesn't make any sense to study all of this in one go right. You need to differentiate at first the sources. You need to differentiate, see different kinds of query loads and then you can try to analyze them individually. And a lot of time we spent so far was focused on analyzing and identifying bots and separating from the users which we did by several mechanisms and heuristics and often by manual inspection of synox and really looking at the data and looking at how the queries look trying to abstract from them in several ways and then classifying them. The result of this is still heterogeneous. I would like to show you some real data at least even if I cannot maybe show you all. So this is a result we have for March 2017. There are 65 million queries in total which you can see plotted here by the hour of this month on a logarithmic scale. So note that the scale is increasing very quickly from 100 to 1000 to 10,000 to 100,000 queries per hour. And you can see four, three curves here of different color and the yellow ones are the known bots. So these are bots which we identified and in many cases also can really tell the bot but in other cases we only are sure that it is a bot and we can identify its behavior even if we don't know for sure who's running it and why. But we know it's a certain pattern which is associated with the program. And so these are the yellow ones. The red ones about 1% of the yellow ones are the users. So these are what we think are real users using a browser to issue queries. They still issue 600,000 plus queries per month. So that's not a small number and quite substantial if you compare it to other query logs you get for such a project. And it also can be seen somehow at least in this graphic that there is some amount of daily variations at least in some times during the month. So we have some hopes that there's really organic traffic in here. So this is that there's no hidden bot anymore. And then there is a lot of gray area in the middle which is shown in the blue parts. These are unknown bots which we don't believe include user queries usually because of the user agents that they give but where we have really no idea who what is the pattern there and how they can be grouped into programs. So this is just one big chunk of unknown whereas the known bots they are classified into individual programs which we can identify. Okay, right. So how do we now analyze queries once we have split them into bins at least in this courseway? For user queries we think that it is meaningful once you are sure that queries do come from a user to apply really descriptive statistics and aggregate these statistics. So it does make sense to say things like 50% of user queries use this and that features or 10% of user queries are interested in celebrities something like that. So this would be a meaningful statement but you have to be sure that it's really users because as soon as you have a single script in there it can completely change the statistics because a single user somewhere has written it in a program, launched it once in a month and it has created queries which completely moves the average into another direction. So you have to be sure that you are filtering correctly. And of course in a way you could say it is not completely clear what the significance of the user queries is. Is it representative of the needs of our actual community because many people may not use Sparkle at all. Is it representative of the information need of Sparkle users because we are removing the people who run scripts and programs to create queries. What is the justification for removing them, right? So somehow they should be taking into account. But the problem is you cannot include bot queries in any kind of aggregated statistics. The reason simply is that the bot queries are so uneven that a single bot can dominate and in that it does dominate the whole traffic. So it turns out that about 25 million queries per month are coming from the auxiliary matchup tool which is a kind of data integration tool by Magnus Mansker. And if you just take Magnus Mansker's tool you can see that these tools together you can see that he alone as a person is generating a large amount of support traffic. So it would be a bit unfair today this is representative of the user community of Wikidata. So we have to do different analysis there and probably this means we have to analyze behavior by bot and we cannot really easily aggregate among the bots because that's something that would not lead to very meaningful insights. So this is our methodological dilemma I would say that the load is so non-uniform that it's difficult to make simple career statements about it. Okay, and now finally before I conclude let me at least show you one thing which again I think I can share without leaking any sensitive information here. This is the ranking of most used RDF properties inside user-generated queries in April. So we had about 450,000 user-generated queries in April and I have prepared here a Google sheet which is open for you to look at if you're interested in because the list is quite long of all RDF properties which occur in queries and how often they occur. So this is the count here and for the cases where the property belongs to a wiki data property there's also the label to be seen so you can also see certain effects here. And looking at this is quite interesting. You can see that for example, the qualifiers for time are used in quite significant ways. So start and end time plays a role in many queries at least compared to other features it's one of the highest fractions here but you can also have other deductions or non-deductions and try to interpret that data. So for now, I only show you the numbers and the interpretation I leave open. Okay, so let me finish, I'll look. First of all, well, good news in general the wiki data sparkly query service is widely used. So that was not clear initially and we can see now that it really does have a major importance in data integration. It says a lot of programmatic usage which is only enabled because of this service and it really helps the community that it is in. We can also see which is also good. We have very interesting complex and challenging queries which are fun to analyze in a way. And finally also surprising and good we have significant direct user traffic. So we really get a lot of queries from browsers not through scripts alone which is also something that was not clear. Bad news of course is in the light of this complexity and heterogeneity understanding this ecosystem is a very hard thing. It's not, there's no unified perspective on what we have here in terms of data and even a month as we analyze right now might be too short to derive any patterns because of these huge variations but we only have three months in total and before we have to delete them. So it is difficult to make to generally understand the complexity of that. So this is again where I would like to ask for input. What do you want to know? What are the particular questions that you would like to have answered regarding this load maybe from the community and which we can then try to directly find methodologies for to unearth because it's not that the questions are just a few or just very few obvious ones but we can ask many questions but to analyze them will take time and we do need to focus here to get results which are useful to you as a community. Okay, with this, thank you for your attention and let me minimize set again so I can also see the chat and we can have a bit of questions, I hope. Thank you. Awesome, thank you very much, Marcus. So we are at our end time but Brendan has generously said that he's willing to stick around for a few more minutes so we can get some of these questions in. First one from IRC is from computer MacGyver. He asks, I'm curious how many queries are invalid? Do people struggle to form a valid query given the current tools slash interface? A similar next step would be to see how queries evolve i.e. small changes improvements in sequence of queries by a personal user. Yeah, so first question, not many queries are invalid of the overall amount even if we just consider the user queries. So there are very many valid queries at least valid as Sparkle. Of course, it is much harder to tell if they are meaningful in terms of what the user has intended to ask. And we do see queries which, for example, mentioned are the F properties that do not exist in the data. So there's definitely a typo and we know that there won't be any results. So there are malformedness issues on several levels but overall most queries are being asked are correct. I should also mention the link I gave for the properties it cuts off at 100. So I didn't show you any only very rarely used properties just to make sure that there's no identifying information. So everything there is used many times. Now the other question was, can we see queries evolve? Now that's difficult. We would have to track the behavior of users in a session. And I have to say at the moment, this is something we would like to study but at the moment we still have to find out in the user queries which ones are actually created through the official interface. Because all we know so far is that user queries are coming from browsers but they may come through other interfaces than the ones which is offered by the foundation as the main interface. So there might already be GUIs available there and there might be other tools which simply don't identify themselves as tools but which make it possible for users to create many queries in one step and one click. So there's actually not a very clear boundary between what is a manual query and what is a tool generated query in the modern world of the web writing. I mean, as soon as you click on the web page there might be dozens of queries firing off in the back all using your user agent and you cannot really distinguish that very easily. But yeah, once we would have be able to at least extract the ones which are related to the user interface we could try to identify sessions. But yeah, one which is difficult because it's really a needle in a haystack because I think these sessions where people really modify a single query are only a very small fraction of the queries we see. At least if I look at the logs I do not see obvious patterns in this way. And so it would be hard to find. Also I should maybe add if I do that as a power user who's creating queries I usually have many tabs and I issue queries on several tabs all with the same IP in the same timeframe. And they have nothing to do with each other just to try several things and then combine them. So making a query and story of such an interaction is really hard. I don't have the methodology at the moment to do that. Yeah, that makes sense. Second question is also about identifying sources of traffic. Your co-presenter Alan asks how do you distinguish bots from users? Can you share some more details? Right. So first of all, yeah, we do have code which is public. So this is not a subject to NDA. We have it on GitHub and it's linked from our project page, I think. If not, I will put it there. But it might not be of great interest, of course if you don't have access to the data to write on. So how we identify bots? Currently, we first use several levels of criteria and heuristics to find bots, which we are sure are bots. There are several ways there. So first of all, the user agent might be indicative of a single tool. Some developers really set proper user agents. So even in the request you have logged the name of the tool, then it's very easy. Other people have included a comment in the Spark query. We asked for this on the mailing list a while ago and some people have done that. And so there is a comment which tells the tool name for the query. If any of these is present, we are done. If not, we are looking at user agents. First of all, so we try to separate the ones which are typical browsers where we really can say this is a Firefox or this is a Chrome and it's unlikely to be a tool and see others. And we further classify queries by pattern. So for every query, we abstract parts of the query, we forget about certain details. For example, we only remember there was an item here but we do not remember the ID or we remember there's a string here but we don't remember the content of the string and then we compare these patterns. And so if a single pattern occurs millions of times in a short time frame, then we can be pretty sure that this will be a bot and then we manually find out how the pattern looks, how the user agent looks and what other characteristics we can find for these queries and then put it in a bin and have this manually classified. So we have some patterns which we know and we use these patterns to classify. We can also, this abstraction and pattern based approach also works in general. We can also do that for user queries, for example, to see if there are certain query patterns which repeat across users, for example, but the variation is still quite big unless the query is very short. Awesome. So a couple of comments from IRC we're seeing this paced into the chat for you Marcus. But I had one follow up question before we end unless there are any other questions on IRC. I don't see any. So I like this kind of work because it seems really pragmatically focused, right? Understanding how people are using the system in order to make improvements in the system. I'm wondering if you have thoughts based on the research, the findings you've gathered so far on how to support, how do we improve the interface to support particular types of query tasks that you, for instance, know are commonly performed? Yeah, that can, of course, easily be cyclic because if we look at queries which are commonly performed typically we find the ones for which there are interfaces. So that's why they are common. So queries which are not so common are also not supported by interfaces. So the ones which are easy are usually the ones we see most. It doesn't necessarily mean that they are the ones which are most required or most wanted by people. But one could try to do that to really, at least on a feature level, we definitely can do that. So we can see Sparkle features. We also have classifications with respect to Sparkle features. So we can really say how many queries to use a certain mechanism in Sparkle. And this gives an interesting ranking. For example, we were quite surprised to see that almost 30% or more than 30% in some months of all queries or user queries are using paths of arbitrary lengths. So reachability queries, which is a fairly recent and advanced feature of Sparkle, but seems to be really important for users in Wiki data, which can be explained in several ways. But it is interesting if you compare the actual usage of Sparkle features with maybe their history in terms of how long they have been around and so on. And this can be somewhat surprising. So you definitely get a different feeling of the weight that each particular feature has for users. And you could, of course, use that when you build an editor and say, okay, this has to be supported, whereas other things are quite exotic and are not used so often. But it is dangerous to draw on such conclusions, of course, because it might not be used because there's a real need for a graphical user interface and when building one and ignoring these cases, you're making the exact wrong decisions. So this is always a problem. I must say we are actually technical researchers. So we are interested in database performance and we are interested in query patterns and characteristics. We are not social scientists. So when it comes to user behavior, we do not have the right tools in our lab really to study what users want and what their demands are. I think to enable this, what we would like to achieve technically is that we can create a data set of queries which can be released and shared with the public so that other researchers can also have a go at it. But we first have to see if we can somehow get across the legal obstacles there. But that would be our wish. I mean, we don't want to do all the research at home. Yeah, well, yeah, I certainly hope that you have the opportunity to release some of the code and data so that people can build off of this work. That's all we have time for today. I wanted to thank both of our presenters, Alan and Marcus Critch. And with that, have a great morning, evening, or afternoon, everybody. Thank you very much. Good night. One more thing regarding further questions on IRC, et cetera. Where is the place to answer them? Or how can people possibly post them? So I copied a couple of comments from IRC. And they're in the sidebar, one from Android, one from computer MacGyver. If you'd like to join the IRC channel, there is a web in our face. Here we go. Let me see if I can send you the, here you go. So if you go to this link and create a nick for yourself, you can join the channel. And two usernames, I posted in the sidebar the users who asked those two questions. OK, great. OK, good. Thanks. Thank you very much, Marcus. And thank you very much, Alan. And thank you very much, Brendan, especially for the last minute tech support that was a new and exciting challenge for us here. For me, too. OK, good. Thanks.