 It's time to begin our monthly metrics and activities meeting for September 2015. My name is Praveena Maharaj and I will be your host today. We'll begin our meeting welcoming recent hires as well as recognizing staff anniversaries. We'll have a community update from Lewis. Dan Gary will present top level metrics from the Discovery team. Layla will join us from the research team to talk about the article recommendation experiment. And Deval will join us from the release engineering team for our feature presentation on voting browser tests. And then we'll open it up for Q&A. These are the names of our new hires. Please give them a warm round of applause welcoming them on board. Congratulations to everyone on this list. We look forward to working with you. Staff anniversaries. We have Mark Bergzma at nine years. Eric Zakhta at seven years. Michelle Paulson at seven years. Santosh Thadingal at four years. Chris Johnson at four years. Dan Andrescu at three years. I won't recite everyone's name on this list. But to everyone who's celebrating an anniversary, thank you very much for your dedication and for your contributions towards achieving our mission. And we please have a round of applause and celebration for staff anniversaries. And now I'll pass it to Louis for a community update. It's summertime and I need the slide clicker. And so because it is summertime, it's going to be a sort of late update this month. I've been surprised at how so many people seem to take a lull. That was an interesting learning for me this month. I wanted to call out to the first thing I want to call out today is document documentation directory that has shown up. How many of you have ever used the phrase, it's on the wiki and then laughed or cried? Yeah. And, you know, that happens, obviously, not just internally, but outside as well. And one of the things that the people are often looking for in the broader community is outreach materials, right? Things like examples of use cases, materials that can help build a pitch, things like that. And so John Cummings suggested in Idealab, hey, you know, what if we built a directory of this stuff to help people find it? And what I think is cool about this and why I wanted to call it out is not just that somebody said, oh, hey, we've got this problem, it's hard for us to find some stuff. It's that, hey, and I apologize if I mispronounce this, hey, Cranin, from the media wiki tools hacking community said, oh, hey, I've got some code around that we can reuse for this, right? And it's exactly the kind of example of people coming together with a problem and a solution, hacking it up, just works, it's a neat little tool, you know, nothing super fancy, many of you won't need it in your day to day, but I think it's a great story of problems and solutions coming together in our community. I wanted to share a quick update about Visual Editor because I just loved this quote. Visual Editor made it so much easier for me to edit. I discovered that I love editing Wikipedia. This is somebody who made 600 edits, a little over 609, I think, in the last month, all of them with Visual Editor, right? So for those of you on the Visual Editor team, a big pat on the back for you. And final note from the tech community outreach, which some of you may, which I've mentioned in past meetings, is now complete, 10 projects got through. So thank you to all of those folks who worked with the tech community over the summer, both those of you who were creating projects and those of you who were supporting them as mentors. It's a really important pipeline into that community. So thank you for everybody. I've got too much stuff in my hands. I can't clap myself, I apologize. I don't normally do this, but I also wanted to call out a couple of conversations that are going on in the community right now. The first one is about the technical spaces code of conduct, right? As some of you will have seen, especially those of you who participate in MediaWiki development, the MediaWiki community in a conversation that started at Wikimania has decided to go forward with drafting a code of conduct for themselves to govern all of the technical spaces in the, all of the code development spaces in the project together under one umbrella. And it starts with as contributors and maintainers of Wikimedia technical projects, still a draft by the way, so you can still follow the link when we post the slides and you can, you can participate. So contributors and maintainers of Wikimedia technical projects and in the interest of fostering an open and welcoming community, we pledge and then goes on to go into all of the details. I think this is an important conversation for anybody who participates in our technical spaces and I would urge you to participate. The other conversation that's going on right now that I wanted to flag is our consultation on reimagining grants. The goal is to better support people and ideas in the Wikimedia movements, right? That's how we resource the broad variety of projects, programs, travel, events and organizations in our movement, right? The consultation is on meta. There is only a few days left. So please, if that's something that especially for those of you within this room that is handled by the resources team, you might go pick SICO's brain if you are not an employee of WMF, I totally urge you to go take a look at the consultation. It's also with some cool new stuff in how we do consultations and that's something the community engagement department will be talking up here about some morning soon, I'm sure. So on that note, I will hand it off to Dan Gehr. Thank you. Hi, I'm Dan, I'm lead product manager in the discovery department. There you go. And I will be talking to you about our top line metrics. So what is the discovery department working on? We've got four major projects right now that we've got in flight. So the first is search. I'm sure you're all familiar with search. The goal there is to make our content searching systems better across all of our Wikis. And we're also working on the Wikidata query service, which is meant to be a general purpose tool which lets people run arbitrary queries on the data in Wikidata. So you've ever wanted to generate lists from Wikidata according to certain criteria, arbitrary criteria. Maybe you want a list of prime ministers of the United Kingdom that were born before a certain date that had a sister called something and you can build these queries. And that's the intent of the Wikidata query service. We have the match tile service where we're generating map tiles that can be used to back map-based features. And that, like if you think about all of the possibilities that we could unlock by having dynamic maps on Wikipedia articles, on WikiVoyage articles, like all of our content, all of our projects, then that's what the match tile service is for. And analysis team building understanding of how people use search and what they need. Because we support a lot of different use cases with search and it requires a lot of analysis to figure out what we're achieving, what people want, how people are using it. People reach out to us and tell us how they're using it, obviously, but there are a lot of people who don't reach out to us who want to be able to understand them too. So what are our goals? I've provided a link on the slides to our public goals, so I'll list them here. Our goal for search is to cut the zero results rate in half. That's the rate at which people get nothing whenever they type a search query in and we want to really reduce that. And for the Wikidata query service and match tile service, the goals are identical. We want to deploy the service as a beta, monitor usage and collect user feedback to decide what we do next. We need to figure out how people are using it, what's missing, what they need to do stuff with it before we can decide how much more investment we should make. And analysis, our goal is to understand how relevant the results we serve to our users are. Are we giving them the right thing? So obviously, for search, right now we're focusing on the zero results rate because that's binary. We know whether we've given them something. It's much harder to know whether we've given them the right thing. So what are our key performance indicators? So for those of you who aren't familiar with those, those are top line metrics that we use to evaluate the impact and the success of the work that we're doing. So we have four. We have the user satisfaction, as I mentioned, like users should be getting relevant results and that's what we want to measure. We want to measure the relevancy. The user perceived load time, searching should be really fast. If it's really fast, people will like to use it. And it's important to know here, we're measuring user perceived load time. We're not measuring how long it took the service to generate results. We're not measuring how long it took for those results to be rendered. We're measuring it from when the user typed the query, how long did it take to get the results. So that does include things which we can't control, but at least we understand how users are seeing what we're doing. The zero results rate, if we give users nothing, then they've not found what they were looking. And we're also looking at API usage. So how much the parties and other teams in the foundation and ourselves, the discovery department, how much we're using the API and trying to measure how many people are building experiences based on our search. So I'm going to go through some of the KPIs. These slides are from our dashboard, which is at searchdata.wmflabs.org. So it's all out in the open. So load times over time. You can see we've only got data for a couple of months. Generally load times are actually pretty stable. It turns upwards and it turns downwards. You can see something happened in the middle of June that caused a decrease on most platforms. We're still trying to build understanding of what exactly this means. Since we're focusing on zero results rate, we understand that a lot better than we understand our other KPIs right now. When we focus more on these, we'll build that understanding. And API calls over time. Again, you can see a dip at the end of June, but since we've only got a few months data, that could just be seasonality. We know that traffic changes over time like goes down some months and back up in others. We don't understand whether that's just seasonality or whether it's indicative of something else right now. Again, because we've been focusing on the zero results rate. So I've been going on about the zero results rate a lot. Why do we care about it? Well, we want to give people relevant results. If someone's searching, we should give them something that's relevant to the query that they typed in. And if we've given them nothing and we've not given them anything relevant, or have we, yeah, I'll elaborate on that. So what have we actually done to try and reduce the zero results rate? The first thing we've done is if the user gets zero results or gets a search suggestion, then we just run the suggestion instead. So the search, elastic search, which is the system that backs our search, it's actually pretty good at trying to figure out errors, like typos and things, and provide you with a suggestion. So in this case, I typed in Manchester, but I typed it wrong. I put two hours in and previously what would have happened was it would give you zero results and say, you want to search for Manchester? So instead, we just do it for them. We can look at trying to figure out whether maybe even if they get results, we actually want to give them the second, the other query, but we're not focusing on that right now because we're just focusing on zero results. And something else we've done is we're running AB tests to figure out if there are better search parameters to use. So there's a link here, links to analysis that we've done on the AB tests. And so we're changing parameters. So what does that mean? It alters the way that the search provides suggestions to users. So in this case, we changed the confidence threshold, which means that instead of having to be really, really sure that a suggestion was better than what you were getting, we made it be as sure that what it was going to give you was as good as what you have. And we thought, well, if that provides more suggestions, maybe that will reduce the zero results rate. And we did an AB test and we found out that although the results were statistically significant, the actual magnitude of the change was so minuscule that we basically didn't achieve anything towards our goal. And we've also, as I mentioned, done a lot of analysis and figured out where a lot of these zero results queries are coming from. And here's a link to the analysis that's public on Wiki. We've actually fixed many of these. Some of them were bugs in our apps or third-party systems. Some of them are intentional and this is when I said when we're giving people nothing, we're actually failing them. Our assumption going into this is that we definitely are. But as it turns out, there are a multitude of queries where actually giving someone nothing is the correct answer. For example, the DOI, for people that don't know what a DOI is, it's like an ISBN for scientific papers. And there's an app that lets you put a DOI in and it tells you whether that's cited in Wikipedia. Almost all DOIs aren't cited in Wikipedia, which means that almost all of the time when we give them nothing, that's correct. And that can be up to 20% of the queries that get zero results. So that means that our goal to cut this in half, when we made that, we didn't know that 20% of the people that are getting zero results are correctly getting zero results. So we've definitely started to understand that a lot more. And for example, that article title and title of link taken from article was one person that was just looking on the Dutch dictionary for something and not finding it. So we pointed him to our page dumps. So how is the zero results rate looking? I told you a lot about what we've done. What's the impact from that? So this graph shows you that we actually haven't had an impact so far towards achieving our goal with the changes that we've made. Our query forwarding and our RB tests have not shown anything. They've basically shown that the original parameters which Elasticsearch has, as it turns out, are pretty good and they know what they're doing, which is nice, but it means that we're not going to get any quick wins there. So given that this is our goal and we want to achieve this by the end of the quarter, which is the end of this month, what are we going to do? Well, we need to try something a little bit more radical and see if we can make a change that way. So why don't we just generate search results in a completely different way and see what happens? So this is the Elasticsearch completion suggestor. So we've set this demo up here. If you want to try it out, suggesty.murphlabs.org slash suggest.html. And basically what this does is it uses something built into Elasticsearch, which can generate suggestions for what you should search for based on the query that you've typed in. And in this case, we're comparing it to prefix search, which is the system that's used at the box at the top right on Wikipedia. The main reason that you want to use prefix search is it's really, really fast, really computationally cheap to do a prefix search. The downside is that it's susceptible to typos. And if you make a slight typo, then you're not going to get a result. As a consequence of that, Wikipedia has evolved this, like, massive system of manually curated redirects where users are having to anticipate typos that people might make and manually create a redirect to take you to the right place. As you can actually see here with this example, where I've misspelt your asset, it doesn't rely on the redirects to actually take you to the right place, which means that if this actually works, we could free up a lot of time for people who are manually curating redirects to do other things instead. And since it's a problem that we might be able to solve. And in this case, it was actually faster than prefix search. But this is just one test. How are we actually going to know whether it's better or not? And is it actually a viable alternative to prefix search? As I mentioned, we support a lot of different use cases. And it's impossible to know in advance whether something will be better or whether it will be worse. So what we did is we actually tested it based on... We took the queries which were getting zero results and we reran them against the completion suggestor instead. And it cut the zero results rate by 40%. It's very promising. But it's just a test. Like, we need to test it in a while. So what are we going to do about that? Well, we've deployed the API to production. Don't panic though. All that means is that the API is there. We're not actually using it for anything right now. What we're going to do in AB test, we're going to give a very, very small proportion of users on the English Wikipedia, the suggestions from the suggestor instead and compare them to the prefix search. We would have liked to have done this for all Wikis and not just the English Wikipedia, but in this case, like the API is very temporary and we have to build the search indices and the search indices take up disk space so we restricted it. We built the API for English and German. We're only going to test the English Wikipedia. So we're going to do that and we're going to start that test next week. It'll run for two weeks and hopefully it will tell us whether users are getting fewer zero results queries. But we don't have our search satisfaction metric yet. As I mentioned, we're still working on that. So this test won't tell us for sure whether we should replace prefix search with the completion suggestor, but it will tell us whether we're going along the right lines. And that's the end of my segment. I'd like to thank everyone in the discovery department for all of the work that they've done on this. And yeah, you'll hear from us soon. Thank you. I will pass over to Layla who will be talking about research. Hi everyone. So I'll be talking for the next 15 minutes. I just want to point out that what I'll be talking about is the result of the work of 31 people plus me, 32. So I'll be doing the talking but there are a lot of people behind this. So feel free to reach out to them. The meta page should direct to some of them and then the end of the slide, they can get their names. So I'll be talking today to you about increasing article coverage in Wikipedia and more specifically about an experiment we ran in June, which now we call article recommendation experiment. So I want to start by kind of convincing you that we have a problem. We have a content coverage problem. So let's look at this map together. This is the map of the world in the eyes of English Wikipedia. And every dot that you see on this map, every blue dot basically represents one article on English Wikipedia about that location. So the brighter the color gets as you go to orange and white, basically the more articles you have about that location on English Wikipedia. On top you see how many articles are available, 950,000 articles about different locations in the world on English Wikipedia. And then you have the number of native speakers of the language in this case, 500 million plus for English. Just to point out that the orange points are not so many articles. The points with orange are basically still less than 10 articles and then white is 30 plus articles. So what you see is that if the only language that you speak is English, this is how much you can learn about different locations in the world using English Wikipedia. Basically, Goodchunk of South America and Africa is a pitch dark. So what happens if you switch to other languages? So suppose you want to learn about Eastern Europe. So you can learn Russian. And if you learn Russian, then you will learn much more about Eastern Europe compared to English. The problem is that if the only language that you speak is Russian, then the world is even darker. And the situation doesn't get better really when you go to other languages as we expect. This is the case for Spanish. Of course, South America, Brazil is a big part of it. If you look at Portuguese is not much better. And then when you go to Arabic, it gets worse. So what I'm hoping to convey here is basically we have a content problem and we're offering very limited amount and amount of content at the very least to users who may be looking for this content. There are different ways that you can look at this problem. The other way is by looking at wiki data. So on the X-axis here, you see that around there are close to 10 million wiki data items that correspond to Wikipedia articles. And these items are only represented in basically one language. Of course, it doesn't mean we all know that it doesn't mean that every item should be represented in all languages. But this should give you a sense. So for example, two million items are only represented in two languages. Another way of looking at the content coverage problem in different languages. And the last way we can look at this is the supply and demand problem. So we basically have close to 2,500 languages in the world. Today, including all the inactive and active language projects, we are covering 287 of them. More than 50% of the world's population who's accessing our content is monolingual. And the next billion users will come in the next five years while they're coming from languages that we have even less coverage. And then there's the supply that we are providing basically as the movement to the users. Articles are created at the rate of 6,500 per day. Forget about the quality and whether they get deleted and all that. This is basically the creation rate. This case, creation rate. 70,000 active editors contribute to our projects. And this number has been pretty steady. And although we expect basically that the demand increases as more people are being added to the pool of our users and readers. And we have 14,000 new accounts created every month. So the last slide on convincing you. So this is how much kind of the expected amount of work you kind of should put into our projects if you want to take the current 287 languages and then make sure that they have 40,000 articles. These are the number of articles that are available in Salkalopedia, Britannia. So let's say all the languages that we already serve something in should have at least these many languages. We need at least three years to get to that point. And this is basically a very simplistic way of counting. We're assuming that all the editor, current editor pool is willing to switch and write in the articles that we think they should be writing on, which is not the case. And also they are able to do that. We know that we don't have that many people in all these languages that we want to write about. If you want to double the size of Wikipedia, we need 12 years. So there are different ways of looking at this problem. And we're going to propose one approach which we call our secret recommendation. So if you look at the state of knowledge in language X at this point, if we don't do anything and we look at the state of knowledge in this language in a point in future, we see that this basically black band is like how much content will be added to this language. What we are hoping to do is that we are hoping to increase the speed at which content is being added by looking at content across languages, comparing them, finding what is missing, ranking them in terms of their importance, and then finding people who would be interested to contribute to this content. So that's the goal. The methodology is kind of simple in words. We want to identify important missing content in language X. For doing this, we just choose a language pair. For example, we use English as the source and French as the destination. So we find all the articles that are available in English but missing in French. This is how we find missing content. Then we sort them in terms of their importance. Then we go to users. We basically find potential participants. In the case of the experiment, we use Babel templates and users' edit history. And then we recommend missing content to these users. The way we do this is we look at the user contribution, the edit history of the users. We make a topical model on top of their edit history. From this, we learn what are these users interested to contribute to. Then we look at the articles and we make a topical model of the missing articles. And then we match articles to users based on users' interests. This is a little bit complicated if this is the first time you're hearing this. So I'll show you one example. Hopefully it makes it slightly easier. So this is one real user, one of our colleagues. Editor E, his last 15 edits in English Wikipedia has been on these topics. So we have basically looked at the articles this user has edited. And then we have fed this to the model and these are kind of the topics you can think about this user has contributed to. So as you can see, the user is pretty interested in natural and unnatural disasters. And then we feed these, basically we look at what is missing in French Wikipedia, what is available in English. And then we recommend missing articles to that user based on the user's edit history. So what you see are basically these are the kind of articles this user is receiving as recommendations contribute to. You can see that most of them are focused on natural disasters. This is one person, the person is also known to be working on early warning systems for tsunamis. So that's kind of the kind of thing. You look at the edit history and then you recommend, you look at the missing content and then you recommend. So I'm gonna tell you about the experiment we ran in June. This is an experiment from English to French. And the way the experiment worked was that we looked at all the articles that were available in English but missing in French. In total, we had 3.6 million articles. We looked at the top 300,000 articles and we divided those into three groups. These are the three groups you see here. And the last one, the 3.3 million, we just kept it out. And then we looked at the potential users who we could contact for contribution in this experiment. There were 12,000 users. We divided them into two groups. What we did is that we took the first article group and we did a personalized recommendation, meaning based on the editor's history of contribution to the first group of users. We did random recommendation to the second group and we kept these other two for other purposes of experimentation. So we basically did not recommend these so that we can say later whether our recommendation made a difference or not. So these are some descriptive statistics. What you see is 238 people out of the 6,000 in the first group started a translation. There were 290 articles created by these users. 123 got published, 8 got deleted. And then ratio of female to male, this is something that we keep an eye on. You kind of see a difference here but this difference is not significant. We would like to still report it. In the random group, we basically reached out to one person who was female out of the, whatever, like 4,000 people that we reached, 6,000 people that we reached out to, right? So this is something to keep an eye on. Two important questions that we wanted to answer. One is, does personalization matter? There's always this argument that, well, you can recommend something to users, it doesn't have to be personalized. There are different reasons for this. One is that personalization is kind of expensive to do, computationally expensive. So if we know users will do as well by just random recommendations, we don't want to go down that path. The answer is yes, personalization matters. We ran a hypothesis and we compared the personalized and random group to see how much they are different from each other. And what we saw is that on average, personalized recommendations boost probability of activation by 82%. So if you are in the personalized group, you are 82% more likely to do something with the recommendations that you have received. The next big question we had was, can we increase article creation rate? And this is important to answer because no matter what we do, if we can't increase the rate of creation, we may just as well leave the system where it is. And the answer to this one is also yes. We compared first the random article group with the health back group. The health back group is the group that is not recommended to anyone. This is the group that is recommended, it's recommended at random to users. And what we learned is that on average, random recommendations boost the article creation rate by 78%. So if we don't do even any personalization, the fact that we take important articles and we randomly recommend them, we can already increase the article creation rate by this much, which is great. Now we were curious what will happen if we look at the personalized case. And what we see is that you can increase that number. If you can actually increase by personalization, the article creation rate by 220%, this is great. I mean, we're very excited about this and we definitely wanna take this from here and do many cool things with it with you. So other important findings, there are a ton of other things we learned. We learned that articles that are predicted to be more widely read are actually more widely created. We read that articles that are created as a result of the recommendations are more viewed. This is something that is important to us because if you create something that nobody reads, at least if you're recommending that you create something that nobody reads, there's a question there, right? Editors who were more active prior to the experiment were also more likely to respond. This was important for us to understand because there are always the discussions around if we start sending, encouraging people to come back to Wikipedia, whether we can do it or not, right? And this experiment shows that at least for this experiment, people who have been around continued also to participate in the experiment. Although we received some emails, people saying that, oh, nice sending this email. I haven't contributed for some time and it was cool to see it, but the data really does not support that. As a whole, it seems the people that we lose, we lose them at least in this experiment. Editors who made at least one medium-sized edit in both languages are most likely to respond. And this is something we also kind of expected. If you do a lot of large-form contributions in both languages, you already know what you're doing. So you don't need really our help or ideas about what you should contribute to next. If you do very little edit, it seems you're either not editing so much or you're kind of a different kind of contributor. And as a result, you're not gonna interact so much with the recommendations. So I'm gonna skip the summaries. The two main parts that I wanna call out here again is that personalization matters in terms of article creation and then the fact that we can actually increase article creation rate. Next steps on our end. We started building an instance on labs in Mexico City when we were there for Wikimania. And now there are 10 language pairs that are offered basically on this instance. You can go there, for example, I went there and I put curiosity up here as an article that I'm interested in in English. I chose the source language as English, the destination language as Persian. And it's telling me these are the list of articles that are closely related to curiosity. It's either the article itself, actually I was surprised to learn that this article itself doesn't exist in English. Sorry, in Persian. And then the close articles to this topic that I can contribute to and are missing in Persian. 10 language pairs, you can check it out. There are a couple of things we wanna try next. One is basically kind of going to editathons. In Wikimania, a couple of people came and talked to us. And it seems now that we see there's a lot of potential in this kind of recommendation, we wanna go out and talk to people about it and kind of encourage them to use it. This seems it's working. You wanna build a kind of feedback loop from the algorithm so that if users think that the topic is not interested, is not important in their language, they can feed it back to us. And then we wanna do some more improvements on the instance and labs and then have the recommendations as part of the content translation tool. With this all close, thank you. Thanks Leila. I'm Dan Duvall from the Release Engineering Team and I'm gonna take this opportunity to tell you a little bit about our work on facilitating voting browser tests in the development process. So first off, Release Engineering, what do we do exactly? I've actually gotten that question a few times since I started here about a year and a half ago. And we're still a relatively new team, so I thought I'd take this opportunity just to give you a brief overview and lead into the actual meat of the subject. This is a rough representation of an iterative development and release process, which I think most teams at WMF are practicing in one form or another, whether it's agile with a capital A or not, doesn't really matter. We typically pull well-defined bits of work off of a backlog, which is curated through the planning process. We iterate on implementation of that work, all the while getting continuous feedback from our automated systems and from each other. And then once things are deemed like working correctly, then we merge those changes and then we safely, hopefully release them out to a production environment. So Release Engineering really takes care of the feedback and release process or at least facilitates it. So we provide tools and infrastructure for continuous feedback during the development process and then also tools to help with a safe deployment out to a production environment. So there are a lot of different kinds of feedback that we facilitate. It's everything from iterative iteration planning and release planning through Fabricator, Code Review through Garrett, maybe Fabricator one day. And then, of course, there are more automated forms of feedback, which is generally software testing. So there are a lot of different kinds of software tests. Everything from unit tests, which look at the most basic constituent elements of your software at a very exacting level. There are integration tests, which test the inner operation between those elements. Higher up, there are these things that we've always referred to as browser tests, but I like to think of them more in terms of the testing parlance end-to-end browser tests in that they test the entire software stack all the way from its most outward interface all the way through the back end down to the infrastructure and back. The reason I think we call them browser tests is just that we work on primarily websites and web applications. So we tend to drive these tests using a browser, but really I like to think of them as end-to-end tests just because it describes a little bit more like the primary use of these tests in general. So you can see here, this is the form that we typically author these tests in. It's a little bit sort of an awkward natural language. We have a feature set, and then for each feature set, we have scenarios where we try to really focus in on one particular aspect, one behavior of our software. And we write these in with three different clauses. We have a clause that sets the precondition, so it says in this example, I am on the login page. The next clause would be the when clause, saying when I perform this action, in this particular case, when I provide good credentials and I click the login button, and then the third one, measuring the expected outcome. If I do all this, if I'm on the login page and I give good credentials, I should have been logged into the system. These browser tests, why do we use them? Well, they, like I said, they test everything in software stack, which has its ups and downs. There's definitely some downsides to that, which I'll talk about kind of towards the end of the presentation, but they do find bugs. And although I put a citation needed, that's mainly because I was too lazy to dig through Garrett and Fabricator to find proof of this, but just a couple of days ago, there was actually a mobile front end browser test that might have, we're not totally sure yet, exposed regression and resource loader when serving up the non JavaScript version of the website. So they are useful in that sense. But up until this point, we've been running these daily, which is kind of problematic in that a lot changes over the course of the day. So if you're only running your tests at the end of this cycle, it's hard to really correlate the failure if you get one back to a specific change that you made in your software. So how do we tighten that feedback loop, so to speak? How do we get feedback for every change that we're pushing up to our code review system? Well, up until this point, I've been sort of punting this off like it was our initiative, but really there were some people, and since I came on last year, a lot of people have been asking for these, but we just kind of thought, well, these tests are a little too fragile. They're a little too resource intensive to run for every change that people push up. But it was actually a couple of people on the reading team and on the search team that set up their own bots on labs and instances. One was named Barry, the other was Cindy, and they would listen to changes in our code review system and actually perform these tests and then report back. And we were really sort of inspired by this. It did the initial experimentation for us and showed that a well-groomed set of these tests can actually be reliable enough and useful enough to run in this context. So that inspired us to move forward into actually normalizing this. There were a few drawbacks that we saw and we really wanted to address these when taking that proof of concept and normalizing it. We thought maybe it wasn't tremendously difficult to set up, but a little bit more difficult than we wanted. But also these tests were running outside the context of our normal feedback cycle. So we wanted to sort of get that into continuous integration, the system that we use for providing this type of feedback. Yeah, so we wanted to be able to provide this functionality in this way to run browser tests for all teams within WMF and also volunteer developers as well. And the answer was taking what reading and search had done with these bots and actually normalizing it as a Jenkins job. Jenkins is part of one of a few different parts of our continuous integration system that takes changes to our software and runs jobs against those changes. So what it does in brief is it takes the change that they do right now just to be doing extension. It takes the most recent version of MediaWiki Core and the dependencies for that extension, sets up a local Wiki, executes the end-to-end browser test suite against that Wiki using real browsers and records the sessions so that if a failure does occur, it can show the developer that recording, give them a little bit more sense of where in the process it actually failed. And a lot of this I want to just say thanks to actually, I don't know if I'm supposed to call it names precisely, but Timo and Kunal helped me a lot in generalizing the implementation for this because there was a lot of this was already there for running JavaScript unit tests. So thanks to them for that. And these are slow of course because we're using real browsers, it's like driving this process and trying to approximate what a real person does is inherently a slow process. So they are pretty slow, but if you have a small well-defined suite of browser tests, they're not, it's not inhibited. So yeah, and it gives you useful feedback. So you can see here that one particular scenario failed and it gives you a link to the video to watch the session. And most importantly, it's performing these tests upon every commit pushed up to our code review system. And even more importantly, which actually isn't represented here, but the failures are actually voting. So they prevent bad code from getting merged and eventually deployed out to production environments. So I just want to end the presentation by saying with a little bit of general advice about testing. Yeah, we shouldn't be replacing, I don't think our unit tests with these types of tests because they are, while very useful, they are inherently fragile. They give you a broad set of coverage and I would say that we should still maybe focus on unit testing, which is very stable, very precise, very exacting type of testing, integration testing, and then finally these end-to-end browser tests on top of that. And that's it, thanks. Okay, it's time for our Q&A session. Mr. Forester, do we have any questions from IRC? Hello, yes. We have quite a few questions from IRC. Shall we start with Dan Gary? First question, also other people can, blah, blah, blah. From Pine, can someone describe how the strategy update by which he means the strategy consultation update that was presented at Metrix previously, how that affects the work happening in the discovery department? How does the strategy consultation affect the work that we're doing? Well, the strategy consultation informs us a lot about what people want from our sites. Like for example, there's a lot more dynamic content. People want a lot more dynamic content on our sites. Like that's part of our reason for our investment in maps is that people were requesting it. Some people requested maps explicitly. Some people just said that they wanted more dynamic content and such. And then there are other projects which are informed by other sources. Like for example, the WikiData query service is an obvious extension of WikiData, letting people access the data there. So we look at the strategy consultation and we decide how we're gonna, what our strategy is and what our roadmap is based on that, based on like presentations that Lila does that tell us about the strategy there as well. So that's the discovery aspect. And another question, what's the timeline for when auto completion will be in production so that in particular it could be used by the apps? So that depends. The API is there right now. If the mobile apps want to run a test, then speak to me, you can absolutely run a test. As I mentioned though, we're not confident that it can actually fully replace prefix search yet. I wouldn't recommend doing that at all. We need to run our AB test that will tell us whether it affects the zero results rate. We then also need to make sure that it's serving relevant recommendations. But I would absolutely love to work with the mobile apps team to add that as an AB test for the apps. I can't give a timeline for that right now. It's dependent on the results of the AB test. And it may be that we throw it away because we find out it doesn't even do what we hope it will. So... Any more questions for our presenters today? James. Hi, now I've got a bunch of questions for Leila. Yes. Oh boy. Okay, first one, from Erin. How do you make estimates about the amount of time it would take to get the level of coverage? So you mentioned how much effort would go into coverage. So we looked at basically 6,500 articles are created every day. So that's the creation rate. And then we looked at every language edition of Wikipedia based on the stats that Wikipedia statistics. How many articles are there right now? And what difference that number has with 40,000? And based on that, we made the estimate of three years or 12 years. I've got a follow-up, sort of follow-up. My directing editors' attention to increased coverage with removing their attention from other activities. How do we know this is a better use of Wikimedia's time than what they would do otherwise? Right, this is a good question. I think one thing that was promising for me to see was that we are not... Okay, so let me say this in a couple of stages. One is that we are increasing the creation rate of content. So it is true that for the experiment itself, we just contacted the editors, the current editors in Wikipedia, but the fact that the rate has increased, it means that we are not just diverting attention from one side to the other, but we are actually increasing the rate of creation, which is promising on its own. There is definitely the issue that people may think that, okay, they know what they are editing and why should we diverge attention or divert it to somewhere else. That's obviously a recommendation. It's obviously to the editor's choice whether they want to contribute to that content or not. And lastly, this is not just about the experiment itself was on current editors. We are definitely looking into kind of reaching out to more editors, basically more potential editors and bringing them into Wikipedia. This is obviously down the line. We have to work with different communities to make sure that we are doing things in the right direction. But this is for new users as well. Hello. All right, thank you. Comment on Lewis's comment as opposed on the Code of Conduct discussion. I wanted to let people know that there is going to be a private email address set up so that people who are for whatever reason not comfortable commenting in public will be able to have their contributions to the discussion and be able to help build consensus even if that's not something that they're comfortable with. So thank you, Lewis, for bringing that up. That's great to know. Will that email address be posted on the current consultations page or current discussion page? It will be posted and it will also be emailed out to Wikitech. Great, thanks for sharing that. Yeah, I'm not, this isn't really a question. Well, I guess it's a question for the C-level staff. While we're struggling to build, get people to create articles in languages, under support languages, Magnus Manskey has built tools that create a decent description in every language in the world using Wikidata and then he has Reasonator, which creates a kind of robotic but somewhat functional article in every language in the world. So my question for C-level staff is why isn't that the highest priority project within the foundation? Because if our goal is free open knowledge, to quote Magnus, that's like a 250 times force multiplier. It dwarfs kind of most other efforts that the foundation can do. Anyone? I just got handed a mic because I'm the only C-level up here. And if there's any other C-level back there who'd like to grab the mic for me, by all means, jump. Let me say that I don't think we've been really great as an organization at prior at like analyzing the scope. And I don't mean this specifically just engineering as well, right? I think all parts of the organization have had this problem from time to time, right? Which is that we haven't necessarily been really systematic about analyzing what are the opportunities in front of us and what are the highest priority ones, right? So without speaking to that specific thing because I don't sit in the engineering manager's meeting so I don't know if that's been discussed or not, we are working across the organization. All of you within the foundation know we're working on a master projects list to help us understand what are the things we're prioritizing, making those trade-offs in a much more systematic way as opposed to an ad hoc way, right? That doesn't answer your specific question, but I'm at least optimistic that we're getting better about looking at those kinds of prioritization questions. Is that a good start for an answer? Thanks. Question from IRC. This is back to Leila. Matt Flaschen asked, how do you determine what content is important? Yeah, so for the purposes of the experiment we did it differently than what we are planning to do in the future. For the experiment itself, we built a prediction model that predicts how much the article will be viewed in the destination language if it's created. So that was based on that. Obviously, the number of page views is not necessarily a good indication of importance in the destination language. So what we did is that we changed the prediction models for any future work. So right now, the way it works is that we look at articles that are available, as an example, in English and French, and we try to characterize them to see what characteristics these articles that are available in both languages have. And by characterizing them, then we build a new basically prediction models that tell us should the new article exist in the destination language or not, not based on page views, but based on other articles that coexist in two languages. Next question. Any more? Sorry. Yeah, sorry. How do you correlate article titles between languages? So you presumably don't prompt French users to write the article actually called earthquake warning system. Right. Do you use wiki data or? So what we do, and this is mostly done on the content translation and for the purposes of the experiment, you would receive the article titles in English. So this was the source language and because the assumption was that you know English, this was not the problem. And then once you would click on it, you would be taken to content translation and then the pop-up would open that would based on wiki data would recommend to you the article title in French or you had to type it yourself. And one last question. Will we be suggesting five possible articles to translate or three or two or seven, why five? Why we chose five? Yeah, this was kind of a golden number we came up with. So we wanted to, for the purposes of the experiment, we wanted to make sure that we don't give the user too few options. And we kind of among ourselves tested a little bit if we received three recommendations, what if they're not so good and are we gonna react to it? And because we didn't wanna run the experiment over and over and exhaust everyone and this was a one-time thing, we decided to increase it to five. We knew that anything over five would be basically too much and would distract users' attention. The email would, this was an email, right? So the email would look too long if you had more than five articles. So it was a hand-wavy way of choosing things, half-informed. Awesome, thank you. Thanks. Any more questions for any of our presenters today? Okay, well, I'd like to close this presentation by thanking all of our presenters, thanking Megan Nestler for keeping time, thanking Mr. James Forester for moderating IRC, as well as all the folks who work behind the scenes each and every month to help put on this meeting, our AV team, our comms team, our front desk team, speaking of which, we'll have our staff lunch next and many thanks to Lamayli, Janet, and Athena for organizing this every month. And finally, thank you to all of you, yes you, for taking the time to join us today and we'll see you next month.