 Can I just get a quick survey of the interests that people have here, because we don't have a fixed outline, so I'm happy to go into whatever is useful. Can I ask, can the researchers in the house raise your hands? Alright. Can the tool authors in the house please raise your hand? Cool. What are you all doing here? Who didn't raise your hands? I'm intellectually curious. Step forward. Alright, so just to quickly give an overview of what ORS is, we use some basic machine classifiers which are things that categorize other things, sorry, algorithms that categorize things. It's based on training data which comes from human-entered, generally human-entered labeling, which is when you go look at edits by eye and you say, that's the F-bomb, not appropriate, this is damaging, and we have a special interface called wiki labels which gathers that data. We can only create models for things that we have data for, so if you have new ideas for models for us to implement, let us know now, because it's a long process to design the model, gather the data, figure out how to analyze it, and create a model, but we're interested in any new applications you think there might be. So ORS is, it's a generic container, it's designed right now for scoring revisions only, but in the future we plan to look at different wiki-pedia entities. That would be revisions as in a snapshot of an article, diffs as in the changes between articles, so the difference there would be a diff is something like the edit that was just made to an article, the revision is the whole article at a point in time, so you would look at something like the quality of that article. The current models we have available are article quality, which gives you something like the wiki-pedia 1.0 assessment of an article that lets you know, is this a stub article, is this a c-level, b-level article, is this a featured article, and we find those line up pretty well with the human assessments of articles. Another type of prediction that we can make is, yeah, I think I should probably help paste links in here. Another type of prediction we can make is the likelihood that an edit is damaging to an article, the likelihood that an edit is good faith or bad faith, those usually line up pretty well, if it's damaging that it's bad faith, if it's not damaging it's probably good faith, but there's some interesting stuff happening in the two places that don't normally happen. So if something is damaging in good faith, then that's a good opportunity to mentor an editor. If something is not damaging and bad faith, or model is probably just wrong, or it's the most benign attempt at vandalism ever. The, we're just offering a service here, but in my opinion of what we're doing, the purpose is that people take these scores and actually create social impact on wiki-pedia with them. So one thing that's done is the damaging predictions can help reduce the article patrolling backlog, so patrollers don't have to go look at every single article, instead they look at just the ones that are most likely vandalism. I can show you what that looks like on wiki-pedia, hopefully. So in recent changes there are new filters, hopefully you've seen these. These filters start with some categories based on Orr's predictions. Can you see them? Yeah, so this is very likely good. We'll make that green. Maybe problematic, probably bad, definitely bad. Yeah. This is really big for accessibility. If red and green look the same to you, being able to choose different colors is great, or someone like me who likes red and green, I can still use my choice. James here is saying that choosing the colors is great for accessibility, absolutely, and it's fun. So you can see that the good ones look green, the ones we're not sure about, or actually these are probably ones that aren't evaluated by the service yet. Those don't have anything on them. Russian jokes, that's a prime target for vandalism, let's see. Actually that might be okay. So anyway, these are just some of the things that you can do with Orr's. Yeah, it would be great to get a little bit of feedback from the audience about what you're interested in though. So this is just a quick overview of what's available. I guess I should also mention a fun new model that we're on the verge of deploying. It's available in beta but not production, is an association between brand new articles that might just have a few sentences in them and the Wiki project that might have a community interested in that article. The idea is that hopefully somebody writes a tool that will connect the new article patrolling workflow with this information so that we can say, hey, Wiki project medicine, you might want to know someone just created an article, but it currently doesn't have any citations. There's a good chance it's going to be deleted. Please go find the editor and rescue them from what's about to happen or help them with what just happened with their article being deleted and them being a new editor. Hopefully that will reduce the new article patrolling backlog. Can I ask are people here interested in how to actually query the service? Okay, for some reason I was expecting more than that but at least two people are interested so I'll go ahead and show you. Got it to Spanish keyboard mapping. So if you go to the top level of our server, you'll see that there's some documentation on the types of queries you can make. All of our requests so far I think are get requests that return JSON. So it's really easy to use it from almost any environment, even from just a JavaScript environment which was actually the first iteration of this was done in JavaScript as a gadget. Here's the top level documentation. Once you dig into that you'll see you can get a score on a number of revisions, you can get a score on a specific revision. You can use this documentation to actually run the API from while having all of your parameters available so we'll say nWiki, revision, something that we know exists. I won't give anything else, any other parameters because they're optional but just so you know these parameters mostly restrict the outputs so I'll look at the most output we can get and it'll say here are the models that we have for this so we have damaging that I was explaining draft quality which is more of a quick, it's another way to help patrol new articles so this is probably okay there's a small chance that it was a spam edit whether or not the edit was made in good faith and then the quality that I was describing and here quality can be tricky to decipher but in this case it's the article is still stub quality we think. I'll show you the actual diff or nobody needs to know that. Let's see so if people don't want to know how to query the service can I ask what you do want to know? Anybody want to jump in here please? For scoring things like the article quality? Yeah okay so article quality I think is done by by taking articles that have already been assessed and Aaron can you correct me on this? Is article quality done using a labeling campaign or by just using existing judgments? It depends on the wiki so for English Wikipedia there is a lot of labels that people have applied to articles to say this article is a stub this article is c-class this article is a featured article but in a lot of other wikis the labels aren't very commonly applied so for example I don't believe that Turkish Wikipedia does any real assessments at all and so we set up a wiki labels campaign actually I think it's still active if you want to try to look at Turkish Wikipedia we can pull up an example and essentially we'll show people a random sample of articles and have them say this article is this class this article is this other class and then train the model based on that. Really the majority of the cases though there are some labels that are useful a lot of wikis will just do a featured article but they'll skip everything else and so we'll set up a labeling campaign for filling in the rest and we'll use some heuristics like the length of the article to try and make sure that we're getting a good sample of articles at various quality levels but ultimately it'll be human judgment whether it's coming from our own labeling system where people can manually work through a random sample or people working directly on the wiki and just labeling articles through their own processes and then to compare the previously labeled articles against new content to then determine a score how does that work? Let's see the the the breakdown is is basically that we just tried to interpret based on things that look like things we've seen before so we have a number of features they're called they're ways that we analyze the article or the diff and they might be things like how often do curse words appear just to pick a simple example and you know there'll be a number like it's zero percent curse words it's a hundred percent curse words it's done by an anonymous editor which turns out to be a little bit problematic. We take all the features that we can that we think are useful about an article we feed it into a thing which which is good at recognizing patterns that it's seen before in this case we're using gradient boosting just the type of machine learning algorithm and it's using this library that you can play with at home too scikit-learn python library but yeah so we just we take a large number of observations which are the type of thing that we're going to judge so we'll take a bunch of revisions we'll feed them into a feature extractor which tells us what what types of things we're looking for show up in the article and then we just say you know the gradient boosting in a sort of opaque way will will determine that this this is pretty close to something that i've seen before and it'll spit out yeah humans say that this kind of stuff that jet is damaging a lot of the time but there are some problems with that approach like whatever flaws the humans have who made the judgments are replicated by the AI I should note too that at least in English Wikipedia a lot of the classes have changed over time so if you look at an article that was labeled B class in 2006 probably doesn't have any citations you can't even get to start class these days without at least having citations in the article and so one of the problems that we have too is that judgment changes over time and so we do a little bit of work to try and figure out when and most recently stabilized in order to make these predictions but it's a good example of how you know whatever whatever quote-unquote mistakes there are in the data ends up showing up in the bottle later I should probably talk about things that we can do with this data I guess the mechanics of how to actually query it are simple enough uh if no one's jumping up and down to see me run some queries then maybe I won't but but the things that you can do with this are pretty cool like we just had um sage ross who's writing the the wiki education dashboard uh take this sort of obscure feature inside here which is the ability to to hypothetically run a score as if something in there had been changed so um an erin's example you need a lot of citations for something to be considered a good article in english wikipedia so what you can do is you can just look at the score for article quality we'll say it's it's still a stub class article and you can try changing each one of those features and see which one which one of these might get me to uh start class article so you can you can say what if I had 10 citations and you you just add a parameter to your query that gives you back what the score would be so you can see how the score would change depending on on those variations um you could also say um imagine I added 10 images see how that changes your prediction let's see I'm not sure I can I can again if you want yeah please um or you can explain it so so here I'm going to I'll see if I can get started though uh so so we have some revision in english wiki uh we're going to see what the quality is and we're going to output the features so these are the raw most of the raw features that we're using these are most of the raw features that we're using to actually um make our quality judgment so you can see things here like uh you know how how many words are in there 420 uh how many paragraphs exist without references what's their total length um how many characters are in this diff and then you can take this feature and you can feed it back in and you can say uh so so now you can help Aaron um inject so don't use the jack just run the query you should run this right against the services okay yeah so here's the actual get request that you would make to to find which features are being used what they look like I don't want to zoom in on that maybe it's a little hard to see back here thanks it's a little bit more complicated than what you're seeing on the screen but really the only complication is that we divide one feature by another and we're not really showing you that here and so for example one of our features is how many content characters per character and so like markup is generally not rendered as content but like a word is generally rendered as content and that's one of the indicators quality that we use you know looking at this list it's kind of amazing that we're able to predict quality at all but this model is actually very very good so that's kind of one of the the things that we run into with machine learning is just so long as we have something that has some signal then usually any any modeling approach will be able to pick up on it make some useful predictions based on it fun story this article quality model is one of our most used models and people find it very useful despite this really simple feature set of number of citation templates number of category lengths number of image lengths that's really it's just a count of content bits I think we're at like 200 external queries a minute and that doesn't count the caches that are on the outside there are a lot of caches that are on the outside that's like somebody wrote a tool that's querying us right now whereas like the scores that Adam was showing that load up in media wiki that doesn't get included in that number because they're sort of externally cashed in media wiki so what I just did here was I took this query which was kind of expanding how we're predicting the quality of this revision and I said okay well right now it's a it's almost certainly a subclass article what would happen if you can see the URL here what would happen if I added at first I tried five reference tags and it didn't make much of a difference in the prediction so I said what about 10 reference tags now it's a start class article so in the wiki education dashboard you can do the same thing in an automated way and you can say you can just suggest to a person writing an article you can say hey nice work on your article but it still considers subclass if you want to be considered start class you should probably add at least 10 references just an example of something to do with the service it's a little bit hard to give examples of things that people aren't doing yet I'm trying to think of how we could generate those here does anyone have a problem you're working on on wiki the tool you're writing the problem you're working on questions you want to have answered about articles what are the models to be ourselves making um we have a list of them um let's see if I can show you that oh really no I think the questions what models have we not thought of yet which is pretty hard to answer so there's a edit type is one that we're probably going to release pretty soon so you'll be able to run this like say for example on a history page of an article and be able to label each edit with the type of change that it was making like was it adding new content was it refactoring was it just doing a copy edit that kind of thing was it vandalism right well we already have that right yeah and so that's that's something that we figure will be useful both for analytics and for just helping people you know navigate the wiki and what's going on in the wiki here's um some background reading on the edit type model there's you were working on a couple um paid editing uh yeah paid editing detection thanks that's that's a really fun one that should blow up in our faces um yeah the idea there is that we have we we as of a couple weeks ago we have a dataset which is a list of sock puppet accounts who have been probably engaging in undisclosed paid editing and so we're doing a similar kind of analysis except in this case I think we're using grammar primarily because people seem to use a bunch of puffery when they're writing promotional stuff um and we can detect that by their inflated grammar that that one's not complete yet but but when it is yeah we should be able to do some type of abuse filtering or hopefully someone wants to integrate that with a tool that will help find potential sock puppets potential paid editors and investigate further uh draft topic is really fun like um I described it already but let's see I'll I'll show you what that looks like so far sure why it's taking that long okay there we go so um yeah maybe we should go ahead and look at this diff but the prediction is that it's in uh history of military history and these are the probabilities that it falls into these other categories it or other wiki projects sorry this particular list is actually what we're calling um middle level wiki projects it turns out that there are a ton of wiki projects and a lot of them are overlapping in different ways so Sumit Astana the the person who created this model has um created uh middle level in the wiki project hierarchy which is a little bit more um bibliographic or something it it it breaks it down into more logical categories an article can still fall into multiple categories so it could be it could easily be biography and military history do you want to show us the article this key map is killing me oh you're okay so it happened to be right in this case we like when that happens yeah um and funny story when we were creating this model there was a little bit of a miscommunication about what we were actually using as the raw material to score and um the model was created using the very last version of an article the the most recent version of an article which is obviously the most complete one in most cases and um and so when we retrained the model using the very first revision we got almost exactly the same model which is it's pretty cool and it means that we should be able to connect someone with an appropriate wiki project right away and if anybody wants to work on that particular problem we'd be pretty great just get in touch with this please um because for all of these models where we only have the resources to create the service that gives you the scores so we're relying we're relying on other people's interests and skills to use them in a workflow um we might have suggestions about how the workflow might look uh we will definitely have suggestions about how to use and interpret the scores coming out of our models erin you have any suggestions about generative questions here i seem to be striking out yeah let's look at some of the tools that use ours thanks actually why don't you just go to the artist page and look for the tools that use our section and you got it that's it great um so these are some things that have been done already there's there's integration inside of media wiki that was the the highlighting and special recent changes it's also in special contributions and um article history page history oh watch this i'm sorry it's in watch list also um huggle uses or its predictions seeing if there's anything completely different here patrouba i don't i don't need to do any shaming here but patrouba is kind of an interesting story this is a bot that was being used on spanish wikipedia and anytime you use one of these predictions you need to be thoughtful about the the cutoff level that you're using to interpret the the uh to interpret the numbers that we're giving you basically um if you can't just use 50 percent because the the score is going to have a completely different meaning on each wiki so you need to actually be thoughtful about what the precision and recall are um if you know what those if you don't know what those are it's pretty easy to look up but basically it's the the number of things that you're catching and uh it's number of false positives and uh that's that's a bad way to describe it let me show you the picture um you need to be thoughtful about the threshold that you're using as a cutoff because that that makes all the difference to um what the what the results actually do the person who wrote patrouba uh used some arbitrary number and it ended up um causing a huge number of perfectly good edits to be reverted which is a very bad outcome uh you you want to err on the other side if anything uh you want to help people find as much vandalism as you can until you start being annoying to people writing good changes um so yeah if you end up using or's please check some of our documentation about thresholds we have a threshold calculation formula built into the models so you can say things like uh i want to know i want to know what the threshold is for catching 90 percent of vandalism you can just type that right into the url and we'll we'll give you the number for that and then you can use that number the number will change over time slowly so you should go back and check that number every once in a while i can show you two things actually to help to help understand that so yeah this is uh this is a little diagram that helps understand precision and recall precision is basically how much of the things we're flagging are correct or correctly flagged and how many are false positives recall is how many of the things you're looking for did you catch with that threshold and so if you look at the if you look at the in actual if you have access to the internet and you run this so this is what a precision recall curve looks like this is for the the english wiki damaging model when you see that um the the x axis is the threshold that you use as you cut off the y axis is going to be the precision recall and filter rate um but yeah just to look at recall so so it kind of makes sense that the higher your threshold is the fewer of the the less data you're actually going to find so so this this will shrink as your threshold goes up for this particular model as you increase the threshold the precision goes up but that's because you're finding less fewer and fewer of the actual items actually that i can't quite explain why that works but if you if you look at the graph you'll you'll at least see you know i just can't explain why it's not like centered on the border anyway as you increase the threshold it seems to get more accurate but that i think i feel like that's um that's not really important uh no you could graph accuracy too um anyway so yeah exactly so as you increase the threshold you're essentially saying that you need to be more confident that this is damaging in order to flag it it's damaging and so when you essentially what you're doing is you're reducing sensitivity as you increase the threshold um that's sort of the term that they like to use a signal processing and so as you decrease sensitivity that means that the stuff that you catch that's really really vandalism and so precision goes up you know you're right more of the time when you say that something is vandalism but recall goes out you're not quite catching as much stuff if you need to be super duper confident about it so as you can see like as as the threshold goes up from left to right uh or recall makes this steady decay um that by the time that our threshold is 0.5 we're only catching about half of the vandalism that's there um so you can actually query the api you use a statement that looks just like this you say give me the best possible recall with a precision above this or or more useful um with a recall above this recall is often useful because that that reduces the number of of good edits that you're bothering with so it depends on your application though um feel free to ask us for advice about that too for what it's worth too if you're looking at playing with this like implementing it in one of your tools and you need to set one of these thresholds um it takes a little while to get used to the machine learning evaluation metrics that will help you pick the right threshold this is something we're pretty good at consulting at so for example if you were setting up a counter vandalism tool that i would tell you that you probably want to set a threshold at 90 percent recall and catch everything that's below that threshold and so you're going to get low precision but you want to have that high recall so that you can catch almost all of that vandalism if you were building an automatic revert bot where no human review was going to happen i would say oh let's set the precision really high because the mistakes are really difficult there so let's set precision at 90 percent and we might only catch 15 percent of the vandalism which it's not a bad thing we catch it really fast um but so these are sort of the tradeoffs that you would make some cases you want high precision some cases you want high recall and it's okay the sacrifice precision and we can help you with that in fact the service can help you with that too if you already know what sort of tradeoffs you can you can make you can actually query the order of service to ask where you should set your thresholds in order to achieve that tradeoff um and then practically if you go to just the the or's page on a media wiki you'll you'll see a description of each of our models that are available in in usage right now then this might be useful too we have a table that's updated dynamically that tells you which wikis have which models available put this in the etherpad any other ideas come up to people questions about things you'd like to see here yeah um i think most of the main vandalism fighting tools use or's um the the situation before or's existed was that there were there were a bunch of ai's written by volunteers the ai's would generally just target english wikipedia or a small number of wikis because they have to all be maintained by somebody um so uh this is trying the or's was trying to solve that problem um if you want to see the specific tools that are being used there's this this url so this also might be useful for looking at example code but maybe not because i think a very small amount of the code in here is is actually interfacing with or's it's probably more useful to just figure out the urls you want to hit and then look at the output and see if you understand it yeah let's just switch to questions any more questions in or's tool well we were here just to show how easy it is it's 30 lights long and i'll be able to download it so somebody think of a question so that you can buy me i'm gonna those questions on the uh okay thank you um reviewer judgment yep so i think we answered that one um the differences between version two and version three the the biggest the the url structure has changed slightly the biggest change is just that we have the threshold calculation built in now um there will probably be a version four but um we're trying to keep compatibility for as long as possible so you can still get to some of the v1 endpoints um when when there's a version four it'll just mean there's some kind of breaking change that we can't have in the same url syntax so you can continue using v2 and v3 probably for quite a while and v1 yeah i think a few things are starting to break but oh that shouldn't be we should fix that okay those are legitimate bugs you should never expect that v1 will change if we do we'll have some sort of rfc where we're gonna try and see if we can get rid of it but without an rfc those are officially bugs okay and then of course we won't go backwards so just try to target the latest api version write your app that way and then you shouldn't have to worry about it for a very long time um any plan to enable article quality indicator another major wikipedia is yes definitely uh let's go to the support matrix um so here's article quality uh french wikipedia has a model uh persian wikipedia is at 16% labeled and so uh this is a labeling campaign in case you're interested i can't actually link straight to it but the idea is just that you have people looking at specific uh you you have people looking at revisions and typing in what they think about them manually it looks like this oh right there we go so yeah um so just just keep track of this table and if you want edit type free wiki um it's pretty easy to get started just let us know and we'll help you start a labeling campaign time for a demo okay all right folks let's see if i can manage this i'm very a little bit broken but i think that i can i can probably manage all right pdf.org okay so we we just started playing around with releasing this draft topic model and i i want you to to see a clear example of both how you can use it and uh how easy it is to build a tool on top of it uh maybe i might even inspire you to build a tool during this or take this very simple tool that i've just built while we were in this session and expanded into something that people might actually want to use as soon as my phone cooperates and lets me do my two factor authentication any administrators in the room anybody assist off on a wiki have you enabled two factor authentication yes okay you're you're good people and you feel good about yourself okay we do temporary key and it just updated let's see if it still works yeah it still worked but it won't work now okay all right so first let me show you the script that i made what looks like there was a topical assessment from the main page oh yeah so so as this script works it is assessing every page it's called draft topic whoops and how does typing work uh and we will be able to do that assuming that i don't take too much time so here's the entire script it it is 30 lines of code um basically what this does is it reaches out to ors and it gets predicted categories for a page any page that you're on um it does this by getting the current version of the page by constructing a request to ors figuring out part of the page that i can write that output to and then making the call and writing the output this is the actual call that reaches out to ors and uh this is the function that actually writes the output to the page the only thing that it's doing here is making sure that ors prediction is above some minimum probability otherwise i won't even show up to you okay so let's pick a page uh i'm gonna do let's see if we can do and bishop biologist and so it'll run here we go so it predicts that this article is about medicine about chemistry uh this history and society history and society it's a little bit of a bug in the um the wiki project hierarchy yeah let me zoom in a bit maybe there we go um let's see geography europe geography countries philosophy and religion language and literature and apparently this matches some maintenance templates so now let's look at the first sentence so a biologist good in university university of cambridge or good in college um so obviously uh geography europe matched chemistry was matching biology was matching um worked on parasites responsible for blockhead disease so obviously we're matching with medicine so this is a good example of what this tool can do um so you don't really need this for and bishop because and bishops article has been around for a long time and it's already relatively well categorized but i want to show you something about and bishop if i go look at which wiki projects have marked this as within their content space you'll notice that wiki project medicine isn't here neither is wiki project chemistry or uh wiki project biology so this is an example of this model is a great way to show wiki projects what uh articles they haven't categorized yet uh one other thing that i want to show you real quick if maybe we have time is like i what i think is really the killer use case for this which is um for reviewing article drafts so i'm going to go to english wikipedia's articles for creation and hopefully we can just go look at a draft yeah let's look at steven g waxman and there we go so without even reading the article we already know that this is about physics it's about medicine it's about chemistry about biology see neurologist let's see seems like it's spot on and so if you're if you're a member of wiki project medicine and you want to go review drafts of articles that are within your subject scope here you go or as we'll back you up anybody want to work on this and make it render a little bit better maybe maybe even make a tool that will render a big list of drafts for wiki project medicine or another wiki project i'd really like to talk to you about it 30 lines of code let me log out so many i'll use my admin account to do anything so uh adam and i will be around at the hackathon uh smear back there yeah there he is amir is also on the scoring platform team we're the team of people who maintain this prediction service we're really interested in the tools that you're developing we're interested in modeling ideas that you have and we're certainly interested in helping you use models we already have so reach out to us thank you