 All right. Hello, everybody, and welcome to the July 2018 Wikimedia Research Showcase. Today, we have two presentations. Our first presentation by Lucy Amécafe, Hadi Al-Sahar, and Pavlos Vujifis is Mind the Language Gap, Neural Generation of Multilingual Wikipedia Summaries from Wikidata for Article Placeholders. And then our second presentation by Fabian Flock will be token level change tracking data tools and insights. And with no further delay, I will hand the mic over to Lucy. Thank you very much. I'll just share the presentation. And here we go. Yeah, so what we are presenting today is together with Hadi and Pavlos is the work on Neural Generation of Multilingual Wikipedia Summaries from Wikidata for Article Placeholders. And yeah, so this whole thing started off when we looked at Wikipedia articles. So there's a clear bias in the Wikipedia towards the English language. So English has almost 6 million articles where other languages such as Arabic have just over under 600,000 articles. And Arabic is the fifth most spoken language in the world. So there's actually a severe need for information in that language. And what we have done, if we have so few articles, is this vicious cycle of lack of information. So there's few articles. We gather a few readers because people don't know they have even a Wikipedia in their language. So we have a small pool where we can gather editors from, which again means few articles. So we are stuck in that loop basically. But at the other hand, so while we have very undistributed Wikipedia, on the other hand, we have Wikidata, which is a source of multilingual structured information. So Wikidata is a knowledge base maintained and edited by a community of users similar to Wikipedia is strongly integrated with Wikipedia already. It has over 48 million items, and each of those items or entities can have labels in over 400 languages. And we could show previously in a study that a variety of languages is already well covered while having, of course, bias towards English. So this is what an entity looks like in Wikidata, what multilinguality in Wikidata looks like. The entities are completely language independent, having a Q or PID and then some arbitrary number. And so each of those unique identifiers can then be labeled with a concept name in, as I said, over 400 languages. So in our case, in our example here, we have English and Arabic, but as I said, could be a lot more languages. So now the big question obviously is how do we get this already existing multilingual data from Wikidata into Wikipedia. And this is where the article placeholder comes in. So the article placeholder is a tool that displays Wikidata triples on Wikipedia in a tabular form. And it is dynamically generated, meaning when something changes in Wikidata, it updates on the Wikipedia article placeholder as well. And it's very important, it's not a stop article, meaning there is no, so we don't just dump information into the Wikipedia. We want to encourage editors and new editors and readers to actually interact with that data and do something with it. But at the same time, if they disable the article placeholders, all those, yeah, those dynamically generated content pages will be gone. And currently it's deployed already on 14 Wikipedias that are all under resource to Gujarati, Haitian Creole and Urdu, for example. And yeah, so there is already part of acceptance in the community for that tool. So this is what an article placeholder looks like. This is an extra screenshot from Urdu Wikipedia about the Celtic knot. So you see that you have those triples or statements of the Wikipedia item in those boxes. You can display images, you can display the references from Wikidata as well. And you have the blue button that is supposed to encourage editors to actually create a new article from that now. But what we can see here, so the same topic, but Esperanto this time, that we often have actually missing descriptions in the language. And then we fall back on the only ones existing. So in this case, English, for example, and they might be really Wikidata specific. Yeah, which is not really interacting. So what we aim at is to use new language generation from structured knowledge base such as Wikidata to actually generate comprehensive sentences, like introductions or summaries somehow of the topic that are into engaging for the readers and possibly the editors as well. So as said, we aim at enriching article placeholder was take show summaries generated from Wikidata using new language generation. And the whole idea is that in general, it is more pleasant for readers than just the pure tables with boxes that we have now. And that they can serve as a really good starting point for editors to start a real Wikipedia article, because in the end that is obviously our goal that all this information has be put in context. And we work mainly and we aim mainly at under resource languages compared to previous work in natural language generation. Similarly, we test an Arabic and Esperanto. So why those languages Esperanto is an artificial language is really easy to learn for humans, assuming it is that we assumed it will be the same for machines as well. It already hasn't engaged Wikipedia community, and therefore it was just a good starting point for our experiments. And as you can see in our chart, both of those have very few Wikipedia articles already. Arabic though is the fifth most spoken language in the world, and the content online and Arabic as far as meaning it's severely underserved and we really need to give out more information. At the same time, Arabic is a morphologically and linguistically really rich language that has a big vocabulary. Therefore, it's just more challenging to learn it as a human as well as a machine. So when we present the work on Arctic Placeholder, usually what the Wikimedia community asks is why not Reasonator? For those that don't know Reasonator, it's a tool that displays Wikidata information in a more readable way to a reader. It's an external tool though, so it's on tool apps at the moment. And the problem a bit with Reasonator compared to our approach basically is that Reasonator can only obviously display as much information as there is in Wikidata in that language. And so we can see here we have the same problems in the Esperanto version of Celtic Knot. There is no description in Esperanto. And so we can, we only fall back to the English one, but even if in Reasonator we do see like longer sentences that are actually expressing information, they're usually manually created templates, which means they take the effort from those small communities away from actually creating an article towards those template creations, which we wanted to avoid. And so now I'm going to hand over to Pavlos and he'll explain you a bit more about our model and how we actually generated the text. Thank you Lucy. So basically we wanted to design a system that takes an input set of triples about a particular main entity from the article Placeholder. And this is basically an exam. So in this exam, Floridia exists either in the subject or the object position of the input set of triples that we're providing to our system. So basically we designed a neural network architecture that is capable of processing this input set of triples and generating a Wikipedia summary in either Esperanto or Arabic by generating one token at a time. This neural network architecture consists of basically two modules, a triple encoder, which is a fit forward architecture that encodes the input triple set and the encoder, which is basically the current neural network that generates the target summary. So basically the triples encoder is capable of processing each component of those triples separately. So it's processing the subject of the first triple, the predicate of the first triple and the object of the first triple, and then subject of the second property and object and so on. And for each one of those triples, it computes a vector representation. Afterwards, it computes a vector representation for the whole input set, which is given as an input to the decoder which then initializes the text generation procedure. But in order to train such a system, we needed a data set of aligned knowledge based triples with Wikipedia summaries. So what we did is this, we collected all the Wikipedia, or the first sentence of Wikipedia summaries from Arabic and Esperanto, and we tried to align them with Wikidata triples. The main principle that we followed in order to achieve this alignment is that we took the main discussed entity of the Wikipedia article and we tried to identify which data triples in which the main entity appears either in the subject or the object position. After with this pre-processing, we basically tried to identify textual mentions of the entities of the triple set in the summary. So in this particular example, we identified the entity of Kalosita with its corresponding textual mention, which is Kalosita. And for the entity of Pico Garolo, we identified the textual mention Pico Garolo in the corresponding summary. Since we are working with underserved languages, we also faced the problem that we were lacking textual mentions for many entities. And in order to deal with this challenge, we basically followed Wikimedia's Plopad language fallback chain in order to identify textual mentions of entities in other languages. So for example, we can have an Wikipedia summary in Arabic, but there might be textual mentions of entities that exist in English. The main problem that dealing with neural network architectures is that you might have textual mentions that occur very frequently in your dataset. And this is very challenging to compute high quality vector representations for those. So what we did, instead of just training with those infrequent textual mentions of the entities, we introduced the concept of the property placeholder. So basically for textual mentions of entities that occur infrequently, we identified those. So in this particular example, we have Kalosita and Pico Garolo. And we basically, after we identify them in the text, we try to replace them with the property of their corresponding triple. So in this example, Kalosita is replacing the text with property P225 and Pico Garolo is replaced with the property P171. The last sentence is actually the sentence that our model is trained on. And it's actually what it's generating at the end. And what we do afterwards, after it has generated the sentence, when a property placeholder is found, then we just post processes the text and we replace it with a textual mention of the subject or object with a triple that shares the same property. And now I'm going to pass the lead to Hadi, who is going to talk more about how we automatically, how we evaluated our approach using automatic human evaluation. Thank you Pavlos. So for the evaluation, for the evaluation of our neural network model, we thought of doing two types of evaluation. The first part is an automatic evaluation, which is a very common procedure for evaluation natural language generation tasks in the NLP communities. The other part is a community studies, which we aim to engage the readers and the editors of Wikipedia in order to give their opinion and evaluate our generated summaries. So as we see now, these are some of the generated examples by our neural network. And we have Arabic and Esperanto for each of the entities. So for group 14, for example, in Esperanto, we have a carbon group OSS and elemental in the group zero, the lapariodo tabulo. And so we can find them the very short sentences. They can serve as a starting point to Wikipedia articles and we thought what can, how can we evaluate this in kind of objective metrics. So first one, the first metrics we have used for automatic evaluation is a set of automatic metrics used widely used in NLP community. It's called blue scores from one to four and one to four represents the Ngram overlap, number of Ngram overlap. So we have blue one, blue two, blue three and blue four. We have meteor and we have rouge. And I said again, those are widely used metrics in the machine translation and also summarization tasks. We've compared our model against three different baselines. The first one is machine translation. So we use machine translation from the English existing documents to arrive to Esperanto and Arabic. And why did we use this? Because machine translation is already existing in Wikipedia. So many people using the machine translation are starting already to write new Wikipedia articles. So we thought, how can we arrive, how can we compare our model to this? The second one, second baseline is an SFNA language model. We use a five gram language model. Third one is a template retrieval and template retrieval methods are widely used for text generation. So basically after building automatically building set of templates from existing Wikipedia articles, we try using information retrieval methods to retrieve the best template and fill it with the appropriate information or keywords. Then an article is generated. So when we look into numbers, we can find that our numbers are actually in terms of all evaluation metrics, blue, rouge and meteor. Sorry. So I'm doing quite much better than other baselines over all evaluation metrics, which actually a very good indication. However, Ngram overlap is not. It's indicative about the quality of your article, but can also not capture Wikipedia appropriateness or other parts of the evaluation that might not be obvious to the automatic evaluation metrics. So what we have done basically is we have run a community study engaging the Wikipedia community. We have run two 15 days online surveys. Part of it is aimed for the readers community. The other part is aimed for the editors community for both Arabic and Esperanto. So we have we have targeted two groups, the readers and the editors for the recruitment. Actually, we have targeted the readers through social media. We have like Twitter and Facebook, but also we have been to the Esperanto subreddit. And we asked people to participate in our surveys for the editors. We have targeted people over also social media, but also very specifically over Wikipedia and community pages and mailing lists. We're very happy to find that the community has actually engaged a lot in our surveys and they gave us very nice feedback. The participation was actually quite remarkable. And some people actually featured our surveys in the Wikipedia conference in Alexandria. So the first part of the evaluation, the readers evaluation. And so we asked the participants in this part of the evaluation to evaluate the generated summaries according to fluency and appropriateness. Fluency is basically how much the text is understandable and grammatically correct and is symmetric between zero and six. And people have to choose a score and a corresponding score for each summary from zero and six. Appropriateness is basically do you feel like this Wikipedia article, this generated summary can be a part of Wikipedia article or not. So we can think this our intuition here is basically we can think that some texts are being well written, they're informative and everything, but they might not fit in the style of Wikipedia writing. So that's why we have compared three different sources. Our generated articles. So that's what we want to evaluate in the end existing Wikipedia articles. So this kind of the state of the art what actually have been approved by the Wikipedia editors to be put in Wikipedia. And the third part is news. So in terms of participation, we have had quite good participation from the Arabic and Esperanto participants have around number of 233 and 406 annotations for both Arabic and Esperanto for fluency and appropriateness. This is the interface was shown to the editors, which is kind of we have instruction, we have the generated summary, which was generated. And then we asked them to select the score for fluency. And for the appropriateness is basically we show them the generated summary or the news summary or the Wikipedia article and we asked them, do you think this can be a starting point for a Wikipedia article or not. So looking to the numbers, we find that actually our generated summaries can be very close to the Wikipedia summaries in terms of fluency, which kind of a good indication that our generated articles are actually very can be fluent are not full of grammatical errors and can be quite understandable. In terms of appropriateness, which we are glad to find the scores actually matching our intuition in the beginning. So we can find that our generated summaries, people, the participants like them and they find them, they can be appropriate actually for Wikipedia articles. Unlike news articles where only 35 and 52% of the participants think they're not even though they're quite fluent, but they're not appropriate to be put in a Wikipedia article. So this is a good indication in the beginning that this kind of methods or neural language generation is actually very have a good potential to be widely accepted by the community. So we don't want only to generate articles that's pleasant for readers and can help to display information over Wikipedia, but we also want, we want also those summaries to be a starting point for editors and why what we mean by this is mainly those summaries can be reused again by editors as a starting point. So we want to evaluate two things. The first one, if the first thing is if someone is writing this Wikipedia article as a starting point, how can we measure the amount of reuse of our articles in a new Wikipedia article? So how much of the text was reused by editors? How can we measure this? We have used a method which is mostly used in the plagiarism detection research field which is called greedy string tiling, which is basically quantifies amount of reuse of your text in any other document. So unlike plagiarism, when plagiarism is high, this is something negative, but in our use case when people reuse actually our generated summaries to start a new Wikipedia article, this actually a positive indication about the usefulness of those generated summaries. We have divided the results into, we have divided the results of the generated summaries evaluated by the editors into three parts. The first part is highly derived, which is basically the editors has reused the article totally in writing a new paragraph, partially derived, which means they have reused parts of the article. And the third part is non-derived at all, which mainly the editors didn't find our generated summary useful for starting a new article. So again, these are the participation numbers. I find that we have generated, the users have been rewrote almost between 33 and 38 articles again using the generated summaries that have reused it to start writing the paragraphs for Wikipedia. This is actually the interface was giving them. So we give them the generated summary, we give them the Wikidata triples, and in an editing field we ask them, can you write a starting paragraph for Wikipedia article in your language for this topic. When you talk about numbers, we find that actually in Arabic and Esperanto we have found between 45 or more than 40% and 78% of the articles were actually fully reused by the editors to start a new articles. So this is what we refer by WD, the holy driven. And this high number of re-usage can actually be a good, or in the case that those articles can be a good starting point for the editors and can be potentially used to recruit new editors. So to conclude our presentation or our study, we can conclude that neural natural language generation actually can work for under-resourced language, especially language with different properties like Arabic and Esperanto. So many of the common understanding that neural networks or neural methods need a lot of data and might not probably might not be appropriate for under-resourced language, but in our studies we have proven that our generated summaries can be actually of a good quality and they have quite of adequate fluency to be put in Wikipedia, but also they're generated in a way that's suitable for Wikipedia. On the other hand also we find that generated summaries can be useful for article generation or article creation, so they can potentially be used by editors to start new articles. From the community study we can conclude that the article placeholder can be a use case for natural language generation tasks. So if an NLP researcher is working on natural language generation, especially using data from Wikipedia in a way, it would be nice or very useful actually to frame the study in the terms of a project in Wikipedia such as the article placeholder which can be very useful into engaging the community. And finally we can find that engaging the community of readers and editors is the way to go when doing natural language generation. So many of the problems of natural language generation tasks is actually that the automatic evaluation is kind of indicative but not adequate to fully evaluate your natural language generation module. And what we can stress on in our study is that basically few engage the Wikipedia community in your task and ask their feedback. This can actually be very useful feedback and can be, in my opinion, it's the ultimate way of evaluating your natural language generation task. That's it and those are the links of the two published papers we have and I think that's time for questions. All right. Thank you all. We do have a few minutes for questions. Daria, do we have anything from IRC? Not for now but I have one comment and one question that I'd like to ask the speakers. So, yeah, great work everybody. It's exciting to see bridging, report the bridges we did at Wikipedia. This is the beginning of a lot of interesting research in that area. So I was curious about two things. If you had experimented going beyond the lead and the summary and if you have a sense of whether like this approach could be generalized for generating like other parts of the article. As you may know, in the team will work in on a section recommendations and there's a complementary problem for underserved articles about how to generate text within the section that will be recommended. So I'm curious about that and I have a separate comment that I'll ask later. Yeah, very nice question. So actually that's basically our future work. So that's what we are currently investigating and what we want to play around with more because so state of the art at the moment is one sentence in English for the biography domain. So that's something we usually drop in our presentation because our work is domain independent because we work on very little set of data already so we do it on domain independent. So already we have we are exceeding what's there already, especially even in English. So that's we want to play a bit more with finding out how can we use other tasks to supply a whole article such as text summarization machine translation. So basically building a tool that incorporates different areas because pure text generation will not help. Just because after the first sentence it will basically generate rubbish because yeah not enough training data not enough patterns to learn from. Right, that makes sense. Thank you. Yeah, very exciting. The second question is about the the interface that use for evaluation. I was curious. So first off, I think I'm very interested in we're all very interesting in strategies to incentivize participation from. Wikipedia's Wikipedia's in general to review machine generated scores or summaries or classifications or whatever. And so I have a general question about like strategies that work in terms of recruitment to find people who are. Like, interested in this kind of off of work, but more specifically, I wanted to flag that we have this interface called wiki labels that I'm not sure you've heard of or have you seen. It's a it's a component of the or pipeline that air healthcare as a sign and takes care of the labeling part and it pretty much works like what you showed so it's a it's a configurable interface that allows you to take. A snippet of text or whatever is a is a is a task at hand and generate labeled data by assigning labeling tasks via cues to register Wikipedia editors. So I just want to flag that there might be an opportunity to reuse and extend that that interface in the future. Yeah, I just wanted to make sure you're aware of the project. And if not, I'll send you pointers so you can you can take a look at that. I just threw the link into the chat sidebar. Yeah, amazing. That's actually exactly what I was looking for for a different project as well. But yeah, so what we found was, especially when it comes to community engagement outside Wikipedia because we didn't include it yet in wiki in the actual article placeholders to evaluate it how people interact with it. We found that working with under-resourced languages is a real helping. So Esperanto community is really engaged. And the same with what Heidi showed earlier with the rabbit community that featured us in the wiki Arabia conference. So people are really helpful and they're really interested in bringing their Wikipedia forward, which was a really great experience to be honest to work with them. Even though we gave them to evaluate not even a proper interface but our not as pretty university survey system. Yeah, that makes a lot of sense. Cool. Thank you. That's all for me. And I don't see for now any questions from IRC. None on IRC. I have one that I'll ask at the end, but no, thank all three of you. I just want to say that I really thought that your user evaluation studies, the design of them was quite clever. So I appreciate that. But without further ado, let's move on to Fabian Flick, who is going to present on token level change tracking data tools and insights. Okay, thank you. Thank you for having me. Let me just share my screen quickly. Okay, does that work for everyone? Yep, we can see it. Perfect. Okay, so what I'm going to present is maybe it's not so much a one research paper on a specific topic, but something that came out of an earlier research paper a couple of years ago and then basically has kind of been growing into something different a little bit bigger. So what I'm going to talk about is how to track tokens in a revision document system like Wikipedia, the data that we produce through that system, some tools and services we offer and some insights we gained along the way. So I'm trying to do all of these and try to give a broad overview of what you can do if you want to reuse that data and those tools. I'm a postdoc at a computational social science department in New Geysers in Cologne, and I want to give a shout out to my collaborators that have worked on all or some of the parts of those things that are going to present over the years. Without them, that would not have been possible. So a quick intro to why tracking changes to single tokens in such a system is actually worth looking into and also not kind of a super simple task is so if you'll have something like Wikipedia articles and their revisions could also be any other document that has versions. And let's see, let's say we have a revision zero that we can see here that created the yellow text. We have revision number one that creates the blue text revision number two that deletes the blue text adds the wild text. We have three deletes the while it adds the blue text reintroduces the blue text. And then we have revision number four that reintroduces the vital text again and deletes the blue text again and then add something else inside of the text above. So what I'm saying here is that we would like to model this in a way that there is not just a deletion and then a new addition but we want to keep. We basically want to remember that this was added this text was added in revision number two, delete it and then came back with all the information attached to it still in place. That is basically the problem. Now you could ask why is that complicated. I don't think about the editors here just think about revisions doesn't really matter who did this. It's just about revisions. Why is this not easily solvable. So you might think immediately of a text if to solve this problem and there are several problems with standard text difference algorithms. One of them is that they're not super efficient. Another one is that they are having some of them at least are the most standard ones at least implemented have problems tracking when something moves from the top to the bottom of a page. It will it will say it's a it's a removal and then a new addition of a token also or a string also it's just a move. But most importantly the attic diffs between the things to revisions to a paired revision at a time are not really suited well for tracking or not at all for tracking actually what's happening what's reintroduced. So can mitigate this to some extent while looking at when the revision is exactly the same. And then you basically just reset or you take over the information of our run to our three here. But there are not so few instances as we discovered as well where you have something where you have like your reuse like here. We don't have exact same revision. And then you have a real for example a re insertion and then a new addition and you would like to track that as well. So that is basically the problem how it presented it to us and we thought about how we can exit. And also we wanted to go through the whole revision history right not only just just the last one for example because we wanted to keep it could happen that something comes back from way earlier. So that should be a possibility. So we looked at how to solve this. And they have been when we first started with this in 2003 middle 2013 that was just coming out of new work. There's not a lot of that had been done in this in this area. So the Alfa and Shilovsky had a nice approach that came out then the Wikimedia's owner and half acre later on had also published some very nice Python libraries in this direction. But well, in any case, what we none of the approaches that we had available actually solved the problem just as we would like would have liked like I modeled here and I showed it here. So we developed their own approach which we call the key who and it was published already in 2014. And what it basically does really in a nutshell you can have to reference you to the paper for time reasons. It basically takes apart the revision and puts it into a revision all the revision than the paragraphs underneath that revision the sentences inside of the paragraphs and then the tokens inside of the sentences. That's the model that we through trying out came up with and the splitting we came up with tokens meaning here also special characters because we operate on the wiki markup language. Not only on the front end. And what we stored is then in a hurricane graph way that are the objects that we're working with internally this is implemented in Python. And then we can just reference things we look for old reuse parts in hash values and hash tables and we can just reuse them and only lose different words really really necessary because of the problems that I mentioned earlier. And that works surprisingly well because Wikipedia is just a lot about reverting and reuse in a lot of places and that works pretty well. What was also super important for us was accuracy actually because we wanted it to be useful in real use case scenarios even on single token level. In another paper we evaluated it for English and accuracy was around around about 95% and we have been tweaking it ever since. And it's actually the only approach that has been empirically with the gold standard shown to be that accurate that might be interesting if you want to trust no use that data in whatever you're doing with it. Okay, so that is the how the what comes out of it is basically pretty straightforward if it was understandable what I just explained. So looking at the sentence the violet sentence here and looking at the whole last revision we will have these strings and these are basically the values for our tokens. They will be assigned original admission where they were appearing for the first time lists where they were deleted and then where they were reedited if that was the case. Every time a new token is originally added we assign it a unique token ID so we can actually track tokens with the same string for example here we have T6 and T16 with a period so we can actually tell them apart in the position where they are in the whole article. And that is basically and you see that we did this here for the last revision we can actually do it for we actually did it of course for all the revision that exists in general in an article. The last then basically that is what you get when when in the data sets that we produce so we call this a talk track data set that we published this last year together with a data set paper at ICWSM. It's from it's up until the October 2016 and contains basically what I just showed you for all articles in English you can get it as a nodo under CC by SA license. Let me just show you how that looks like if you really want to reuse it. So yeah, we have basically have basically have batched article files where you can have the current content that was present at the time when we created this October 2016. And then all the content that was deleted up until up until 2016 in batched files marked with the page ID ranges and then you can just download this and get the pages you want or just read all of them. We also have published there some material that is computed on top of it mainly reverse conflicts course and survival of tokens over time, which we also reference in the data set paper. So you can study that if you want to know how to use it and what it actually represents. But basically it's what I just told you here. Also maybe interesting is that the whole thing we run we have been running this continuously and right now offering in five languages API that is listening to the event stream of Wikipedia and getting for five languages getting me real time. This information that I just showed you in the in the JSON format you can get that from our API. So, for example, for running tools or you can also carry it for for subsets of the data and go to API wiki who.net and get it there. And there's also the documentation and a nice little swagger interface where you can try out the different settings and try out trying out what you can get actually in terms of data. Maybe some some background to this so gazes is actually provides research but also also not to small extent services for the social science community that includes the computational social science community. So the interest of gazes and this is of course first also or first of all also Wikipedia researchers, but on the other hand also social scientists that are interested in social phenomena. So we have a current internal project where we look at the German Wikipedia and political terms that are disputed or that are not surviving in the Wikipedia or getting deleted in the German Wikipedia and cross referencing this with with Twitter data and the same terms there and the salience of the terms there. And some sorry, and some Facebook data as far as we still get them. So that's basically why this is also interested in this and keeping up the service. And we're also looking into extending this to other wikis apart from the wiki media projects, for example. Okay, so that's the data, how it was created, what it is, how you can get it. Now I want to talk about quickly about some, some, some of the things you can do with it. For example, you can look at the types of edits. So we have all the actions to release the additions and reinsertions and Wikipedia over the years. You can see that closely follows the edit curve if you have seen the edit curve before, which is no surprise, of course, because more edits more actions, not one to one. And you can actually see what's how the additions, the reinsertions in the deletion behave in English. See that in 2006 when we had to search 2007, we actually have more deletions coming up and consequently also more reinsertions because reinsertions follows, follow deletions, right? In German, prepare German for comparison in German, it looks very similar with the actions and also with the additions, reinsertions and deletions. So the general pattern seems to be very similar. We can also look at survival of these actions over time. I'm adopting a measure here that Aaron Halfaker has proposed, I think, one or two years ago in a use case, in a showcase, which is basically we look at how long they survived. We set 48 hours at the threshold because afterwards there's not so much happening anymore in terms of deletion. Most deletions happen actually even much earlier in the first 12 or 24 hours. So if it survives 48 hours, we say, okay, that has survived. We see, and we can also do the relation of those two curves and we see that there was a dip here in 2007 again for English and then it recovered and stabilized over time. We can see when we compare German and English in this respect, we see that German had the same decrease here, but then also stabilized just on a higher level. We can look only at deletions. So far we looked at all the actions. We can only look at the deletions and we've seen interesting spike of deletions. We also see that these were obviously or apparently quite successful. If we look at the ratio from one to the other in comparison to all, of course it's slower, but deletions are not as successful. They cannot survive as long. That's not a surprise because it touches someone else's content usually. But also we see this interesting spike, which to all of my knowledge, which actually survived very well in terms of deletion, which to all of my knowledge is the deletion of the language links from Wikipedia when Wikidata came up and we did not meet those anymore, mostly probably done by bots. And German just shows very similar patterns with deletions. Another thing we can actually look at is reverts. So far most research as well has used identity reverts as they are sometimes called where you have the same revisions and then everything in between is considered to be deleted. You can do that still, but you can also actually see you can quantify this. For example, here I quantified revision two, I said they had deleted five tokens. So revision three is actually undoing that deletion and deleting those five tokens again. So I can actually say, okay, there's a quantity of ten tokens or ten actions being done here towards revision number two. I can also do it with when there's no identical revision, I can say, okay, here for separation number four deletes one token of three, and then we have a one in ten partial reverts. And when we look at the distribution, this is from the talk track paper, we actually see, yes, there are a lot of the reverts, actually full reverts, and a huge part is actually very small ratios of things being undone. If you look at the absolute numbers, you will see it's a one, two, three and four. It's even more power law distributed on the left side. So many small changes, of course. But interesting here is basically three to four is the ratio from partial to full reverts. If you have any questions about the weird spike in the middle, I can answer that later. And we compared also what is the, what is the, trying to find only the full reverts with our method, what is the difference then when we try to look at the old school or more naive version that was very in the beginning that you can also use is basically having two revisions, identical revisions, everything in between considered deleted. And it turns out that actually that's not that bad, like other research also has basically assumed. Through other means, and that's pretty good. So we basically, when we were learning something, we know what is the full reversion, what reverts fully. But on the other hand, you do some considerable amount of mistakes when you when you say everything in between two identical revisions is fully reverted, what actually you're getting partial reverts, or sometimes even no reverts at all, which can happen, for example, if you have a long, we have people trying to fix vandalism, don't succeed completely and then incrementally they try to fix something. And in the end, we go back to a revision that was there before. And some fun facts. They're 61% of all, and it's includes some kind of removal and resurgence is just adding. And around 15% of the revisions actually are self corrections to some in one or the other way. Another use case I like a lot is, is ngram survival and popularity. So you probably all know the Google ngram viewer, it tells you for specific time slice or time point what existed in terms of ngrams in the Google ngrams. And here we're doing it with one grams only. But in and what we can do in Wikipedia is not only that which we can do before, but with this data here we can actually look at what is added at a specific time point and what survives at a specific time point, which is a different So I call them vanilla ngrams, one grams because they're pretty, there's nothing special about them. These are colors that green, they're not ambiguous, they don't invoke any disputes usually because they don't have many meanings. So what you see here is on the top, you have the ratio of how they were added over time in relation to all other strings. So I aggregated all the tokens that have the string green here in relation to all other tokens. Pretty stable. And if I look at the survival in relation, normalized in relation to the other ones, I will also see, I will also see that one is basically the average here, and I see this pretty average in terms of survival and it does not change. On the other hand, we have things that became popular over time, like the rev tag from the markup. It became popular over time and it was added and added more, but on the other hand, it was a stable survival rate, so it was always surviving pretty well, even a little bit over average. On the other hand, there are things that you should not use, for example, from the manual style words to watch. And if you look at those, we see very different patterns. We see, for example, an interesting wild card was used less over time, so people tried less to edit, but on the same time, when they tried to edit, survival rates also went down. So this was becoming more and more unpopular to use these things, for example, puffery words, etc. Same thing for a parent, same thing for famous, so almost all of these words to watch have this kind of pattern. But we also have this pattern in some other words, like later. So we're currently with some linguists that are very interested in this data. We're looking into this also in two, three, and another grant, and seeing what are the patterns that actually we can find here. And other ones, for example, where you have conservative and liberal, where I personally expected them to have very low survival rates compared to others, they're actually pretty stable as well. Okay, so that was basically survival of tokens. Another thing you can do is look at how often they actually go in and out. So we have a pretty straightforward heuristic that we describe in the paper as well, to just look at how often they get deleted, reinserted, deleted, and so on. And what we can then do is, if the time is shorter, we do an exponential weighting, where the short ones are weighted much higher than the ones where there's a lot of pause in between, because we don't consider this like a heated conflict. And with this measure, we can basically calculate for each string or for each token, and then in aggregation for each string, how conflicted they are. And we did that for all the tokens in Wikipedia and English Wikipedia and that data set. And this is an interesting example. Also Dumbledore and Voldemort obviously are super conflicted in the content that existed still in the end of October 2016, that a more serious note actually that helps you to horizontally to articles, even if the article itself is not really conflicted, helps you find things. And when we talk about bigger engrams that becomes more relevant, finding things that are conflicted in different articles over time in different articles, and then you can basically find them, even if the article itself is not conflicted. And of course you can also do this over time. So in the project that I mentioned before in GASIS, we're trying to do this with political terms in Germany, seeing how their distribution is and how over time their conflict scores actually change, and if they do so. And where that happens. So that's another more serious application of these scores. Okay, I'm going to skip this. You can do products because of time, because, but you can basically also look at success rate of official IP edits and you see that they're what happens there is that the official actually pretty productive and they don't get into fights. That's our preliminary results for this one, which is also one thing you could do. And of course with different users, you can do the same things. Sorry for skipping so fast over this one, because I want to have some time to talk about the tools. So one tool, one class of tools we have is markup tools. So who color has been around for two to three years now, actually feeding from the live API. What we can do here is prominence highlighting for authorship. We can also mark different words or sequences up and we can visually inspect their deletions reinsertions over time. And I don't know how long that took. We have this conflict score that I just mentioned and can see that in a in a in a in a visual way with highlighting and age, for example. So this is live. This is a temporary grease monkey extension to to to any to Firefox or the Chrome and you can basically go to a lot just to any Wikipedia site in the languages that I mentioned and can get this. The game dashboard also uses this to highlight the impact that the learning users, the edit learning, editing learning users basically have on the articles to give them some insights and motivate them. I think it's tools uses it also right now. I'm not sure. So basically, what all they'll do to use our color API, which is basically a augmented version with more information of the default API, we can get all that markup readily made for you, provided for you, and you can just use an interface if you'd like to. And lastly, there is one tool I personally like a lot is just it's kind of still in development because it's not it's not completely the performance yet. So I'm not it's not released yet, but I still want to give you an insight because it's a really good use case of what you can do with that data. So it's a student project and I made a screencast because we all know what happens when you try to do live present live demonstrations or something, especially on the web. So this is basically what this is is a is a Jupyter notebook, several to the notebooks that are binder ready so you can use you can directly use them in my binder instance in my binder or any binder instance. For example, we have one of Jesus and it queries live data is in the app modes. So you can toggle it by default, you don't see the code, you can use the interface interactive features of Jupyter, but you can also get access to the code if you want to. And I anonymize the user that I'm that I have queried here, but it's a live query on the wiki who and the API APIs, and it opens a notebook here after passes parameter. And what it does basically it queries our API for the API API gets some general statistics like pages created talk pages created. And then when we go further down, we have the total ads deletes and reinsertions plot with and without stop words and their survival rates in total for all the articles ever edited. We have them per article as well. These are all data frames and pandas you basically just total the code and you just get straight to the pandas data frame and to use that for whatever you want to do. Some visualizer visual visualizations of this in bookie way can also play with that data of that editor over time see their insertions deletions and so on. Here's the same thing for deletions reinsertions and additions and with the survival rates. So you can, for example, see that in some in January 2013 or February March, maybe much way they had a very not a very successful as a streak of deleting stuff, but it did not work. Then we have most own words and articles so we have words owned you can see what where they own the most words in the term that they originally wrote them some time related things of time spent editing. And then we have the conflict score again. So we basically just add all the words that had some kind of conflict where this user was involved, and then we can order it by the conflict score and then we can go and choose that article and open the next notebook. And then we go, we only see the information for that editor and that article, and then we can investigate that further. So we see again, the ads delts and reins for for that article and can zoom into this we see. We see the words owned relation in relation to the other words and in absolute terms. Of course, with the article grows you will have your your share will go down but the total share could be the same. Then we have overviews, visual overviews about above all tokens added. We have the data frame displayed with the stop words in the in the tech lobby, for example, take them out just to show basically what you can do with the data. I know tech labs are so 2000, but they're still pretty helpful in this instance. We see for example here that containing an aluminum were some terms that this attitude deleted a lot in this article. So, of course, we also see this in the table. And finally, we have some selection, we have some selection menus where you can actually see without stop words to survive once you can just see the head of the data frame and so on and so on. Yeah. Okay, then we have the same thing for deletions again, some user some some general metrics again, and we have the conflict overview again this time only for the article. We see that the conflict obviously appeared to have was actually introduced by 2007 and we look at what is was actually conflicted. Not surprisingly, it was all about tokens that have the string aluminum or aluminum kind of what we saw already in the deletions. So the user was actually doing here sorry for that. Just ended what the user was obviously doing was having a kind of revert fight and war about that word. And when you go actually could add the revision here with the link you can go to the diff and you can actually inspect it. And that will tell you will show you that this is actually the case. Okay, so that was the basically the data that we have some tools now to release this these notebooks pretty soon. And around the performance issues that we still having so that everybody can use them. Yeah, and that was basically it from my side so yeah happy to take your questions. All right. Thank you for being that's awesome. Daria, would you like to relate the questions from our sea. I think we have. So following the order you mentioned we have questions from Aaron house worker tealman yourself and myself. Although I didn't quite track the question from Aaron so I thought it was posting it somewhere else. I know what I know what it was. Yeah, I had this actually had a similar question. This goes back probably onto your, your analysis of partial reverts. What what constitutes a partial revert a partial revert constitutes everything. So of course there's you can argue. Let me go back to the slide. Sorry. Sorry, I'm just sorry, sorry, sorry. Okay, here we go. A partial revert basically is when you don't undo everything that it has been done in a revision. So by by definition, you can do undo everything that has been done that always happens when you have for example a revision. Some other revision and another identical revision to the first one, then you always undo everything in the second. A partial revert is basically if if you undo something that the second one has done and there is not an identical revision created. Well, basically, if you if you have for example an addition or a deletion here and this and the third one undoes it, then the deletion is partial. You can argue about how long you have to go back in tracking this or you can argue about. For example, do I have to look at undoing of a deletion or undoing of an addition and what constitutes actually undoing something? But yeah, well, in the end, it's not everything is undone. I don't know if that answers it or if you have to go into any. I think that's I think that so just to make sure that I understand is this is it's not the same. A partial revert is not simply a an edit that removes some tokens but not all tokens from a previous edit. Yes, it only always it always refers to the actions that the revision has done. So if a revision has done an action and the action could be simply deleting something doesn't have to add anything. So if I delete one word and the next edit just puts then delete two words and the next edit puts back one of those words, then he has. It has undone 50% of what I have done basically 50% of the deletions I have done. Okay, so partial revert can cover a wide variety of different types and behaviors. So it's not as not as kind of concrete as a full revert and not at all what we assume to be the intention behind it. No, it's not as in not exactly so it's not easy as easy to interpret as well. That's I write in the paper I wrote actually as well we wrote that that's really hard to say that's antagonistic. Most of the time is really not antagonistic is really corrections or even building on top of someone else. So revert might actually be a wrong word here because it's kind of has a connotation of antagonism, which is not what's not happening here most of the cases. That makes sense. Okay, cool. Thank you for the clarification. Toman or Daria. Thomas in the room so I think you can directly as a question. But the room is muted. So, Brandon, can you unmute the room in San Francisco. Is it is the question just is that are the slides available online. That's the question. I will mute. Yeah, I was wanting to first of the slide will be become available to be great. And separately. This obviously has a lot of potential applications and you already mentioned some things that are already being used. I was wondering what the applications that if you would most likely to see what you think would be most useful for actually use by Wikipedia editors or readers. Sorry, it was a question what I think would be most useful. Yeah, what would you would most like to see but actually nice and made available to readers and editors on a lot of scale. But I personally mean here on a larger scale I think what's really what's really useful for most people is really just in the history of single parts of an article as far as from the feedback that I got to the things that we do. And we're not doing a super good job with the color tool with this because we just because the visualization is kind of still preliminary really just instead of just going through like finding changes to specific parts of articles. I thought was really important for people because they usually have to click through the whole revision history to do this. There's some tools that do this but they don't not super accurate. Yeah, and then basically track down the regions that are where the where change happened the first, second, third time. I think conflict for example, I mean age might be interesting but conflict for example, it's a little bit like the wiki trust we had a couple years back. Even if it's super accurate, it might be super hard to interpret for most people. So I think the more straightforward things are more useful for most people. And it's like will be online. Yes. Awesome. My question was about the that intriguing demo of the series of Jupiter notebooks that you displayed. Yeah, and I wanted to know if the intent behind that. The intent is to make it possible for people to basically generate a set of notebooks based on some input data. Kind of dynamically, right? So that you can you have this this template for this analysis template that you can feed new data into and then have all of these different methods of visualizing and analyzing your data just available to you like a creating a dynamic dynamically generated interactive report is what it kind of looked like to me but that might be too optimistic. Yeah, so our our trying to put my camera back on but it doesn't work. Sorry. So our intention was basically we have a lot of we do also seminars that gives us we train social scientists and people that want to get into data science and also explore this data like Wikipedia and the idea here was like instead of having like closed web app instead that is the one thing that you have what people really like to use. The other thing is the Jupiter notebooks that I will explain but it's still code so people are a little bit afraid of using it having some kind of hybrid where you can get live data and analyze it for example on the people level because that's what Joseph social scientists also used to and then if you want to can dig in deeper. You can for example toggle the edit app the app mode and you can just go down to the data frames and then play around with those. And that was basically the idea. So we also plan to use something like this or this in teaching. So that that is basically the idea behind that. And promising. Yeah. I mean of course one of the biggest hurdles is performance and we have to be looking into caching what we can cash what we what can actually be done in real time. So that's why we did not release this yet. But once that is done and also some of the data we have some problems with the data security that we don't expose things in the back. We do well. Yeah. So once that is solved and that should be publicly available. Okay. Can I go next? Yeah. All right. Fabian excellent presentation. I'm blown away as usual by by this work and I'm super happy that we gave you a slot to present. Hopefully more people rely on the dataset. I have a few comments in the question. The first comment I had is that obviously we've seen many different versions of the editor decline and recovery plots. We were having some back channel discussion with Aaron on IRC on how he obviously produced some of the token based plots for productive edits. But it's the first time that I've seen like these other series of plots. So I just want to suggest that this is something super interesting that our, you know, community and in foundation staff may not be aware of. So I just want to flag that because they're extremely original delivery rich story about the dynamics of our projects. Okay. The, the second quick question I had is so you said that the obviously the static dataset, and then you said that the API produces basically generates a response in virtual New York time. So are you, are you ingesting the live revisions from a recent changes feed or is this based on dump processing? No, this is an event stream processing that we're doing. Gotcha. So virtually any, any change that happens in real time. Yes. Yeah. And I don't want to, I mean, we're a research institute with limited resources. So if I say near real time, of course, there's a lag. We're not with uptime. We are pretty good, but not that good. So, but it works pretty well. So yeah. Which means maybe everything is captured again before deletion actions and so on, which is a, which is really interesting. The final question is about the scope of this. So my understanding is that obviously the, the first static release was limited to the main name space. Is the, the API extending to other namespaces, or if not, do you have any plans to do this? And I'm going to give you like one reason why I think this would be super interesting. So you want to give me the reason or No, first want to hear the answer. Yeah. And currently not just because of limited resources, not that much work actually to do it. I mean, we thought, of course thought about top pages, but with top pages is not doesn't make that much sense because you basically, you can just see whatever anybody has. They're not that many deletions, reinsertions for, for, for top page entries. Well, well, I would say you can correct. Yeah, of course. It turns out, yeah, I'll let you finish. So that was my reasoning behind this. And that you can just use the signatures and see what was written there, but you might correct me. But for other, yeah, why not? I mean, it's totally technically possible. And if we just have to focus our attention there where it actually everything matters most. So, but then that is possible. Yes. Okay, okay. Yeah, the reason is that so we have this paper that we wrote in collaboration with researchers at Cornell and jigsaw that looks at the reconstruction of discussions on top pages. And we're doing this not just taking a snapshot of, you know, a markup of top pages, but they're also looking at the changes between revision and revision. And it turns out like one of the applications of this work is that we realize that they're a sizable fraction of content. You would assume that, you know, threads on top pages grow incrementally with very little removal or changes, especially discussions. It turns out there's a lot. There's a lot of content against refactor changed. And specifically in the context of toxic language. There's a lot of moderation that happens, which is invisible if you only take the final snapshot of the discussion so you would, you would basically not realize there has been like some some kind of like negative interaction during the conversation, but you start seeing once you reconstruct the entire token change in the in the namespace. And so I think would be, I'm going to point to your data set to my collaborators, but I think it would be fascinating to use that to see for for studying the same, the same phenomenon you were looking at in the context of main article writing, but in the context of discussions to see, for example, you know, the prevalence and time to removal of specific of specific tokens that may be related to personal attacks or harassment. I think it's an entire opportunity in that space and I really liked the idea of extending these two other namespaces. Yeah, that is a very interesting idea indeed. Yeah, I was not aware of that dam that meant changes. Yeah, I'll share a pointer to the paper so you can see it. Thanks. Awesome. And so that was my question and my comments and I don't think there's anything else from IRC about either of the two presentations or give a few more seconds to see if I'm missing anything. Anything else from the room or from our other speakers. Yeah, maybe just. Do you have any other publications represent the pipeline or for the way you want to look at this data. Yeah, so we couple actually two or three years ago we also had a application that application visualization of the interactions of editors. So what you can do once you have these partial reverse and and these conflicts course you can also translate them into edges. And that's basically it was demo. I think we never got around to actually implementing this with life data or doing a bigger data set because of course it's a little tricky to interpret. Like we just heard the question with partial reverse to actually with full rewards is always kind of clear that something is stick. Partial rewards becomes much more complicated to actually translate this into specific edge, but that could be something that would be very interesting. But you have to be careful because you're basically making a judgment about interaction when you show this to editors. So that's like this different generation. I find really interesting. I had one more question for quick question probably for Hadid Lucy and Pavlos. On your user evaluation where you were looking at the grammatical correctness and kind of appropriateness for wiki. Of your automatically generated lead sentences and and then a comparison set from Wikipedia and news. It wasn't clear to me what was in that comparison set. So when you say news, like what were you what were the actual strings that you were having people evaluate. And when you say Wikipedia, what were the actual strings you were having people evaluate. So since we aim to generate a summary, we work with the first sentences. So it's basically the summary means the summary of the topic right so as the first sentence of Wikipedia. So for both cases we would use the first sentences of the first sentence of the news article or the first sentence of Wikipedia article versus the first sentence of one of our generated sentences. Okay, that makes perfect sense. Thank you. All right, I think that that is it. Thank you all very much. Thank you to everybody who tuned in to this edition of the research showcase. Thank you to the presenters. And with that have a wonderful day everybody. Thank you.