 Hello, I'm Alex, also known as Riven, and I'm presenting gen work with Sam Wilson and Natalia Rodriguez of the community tech team at the foundation about OCR and all those tools for Wikisource. So I'll introduce their presentation, and they will carry on first Sam talking about his work on the OCR tool. And Natalia will discuss the community wishlist survey and other tools currently under development at the community tech. So what's OCR? OCR is a software that converts images of texts to actually machine and coded texts. It's the work that we perform every day on Wikisource, basically. Using the text in the image of the page to a text that can be cut and paste and used elsewhere. You've seen this new tool in the top right corner on the namespace page on Wikisource recently. So Sam will do a brief demonstration of this tool. And then Natalia will carry on commenting on how the wishes that we compile every year in the survey are actually selected by the tech team. And which are the other tools currently under active development, OCR tool of course, and also the eBook export project, and the import tool for files on Internet Archive. So without further ado, here's Sam Wilson. Hi, everyone. I'm going to demonstrate the work that community tech has been doing lately about OCR on Wikisource. So the state of things has been that there's been two gadgets. One to run Tesseract OCR and one to run Google OCR. These have both been in the main editing toolbar. You can see here the two OCR buttons, one in black for Tesseract and one in the Google colors for Google. Clicking these loads runs the text that's visible in the image at the right-hand side and extracts the text. It is just the image that is visible here. It's not a higher resolution version of it. Google button does a similar thing and the two OCR engines produce slightly different results and each have different strengths and weaknesses in punctuation and layout and columns and that sort of thing. So the 2020 wish list proposal was to unify these two gadgets and put them together in one tool and in one Wikisource extension that is then available on all Wikisources. These gadgets are on, I think, about 20 or 25 Wikisources but not all of them. And so there are some Wikisources that don't have technical users who are able to set up gadgets. And so it's really good to have the code in one central place. It can be translated to all languages and it can be deployed more easily to all Wikisources. So I'll switch now to a demonstration wiki and show you the new OCR system. So if I go to a random page here, you'll see at the top right side of the editing toolbar there's a new button called transcribe text. And you can see there are no other OCR gadgets, although they can be left turned on for the time being if need be. So when you click transcribe text for the first time, you get this onboarding tooltip which explains what this new button is. And when you hit OK got it, that will disappear, the blue dot will disappear, and that won't come back again. That's gone forever. So if you then click transcribe text, you get an indication that something's happening, the text is inserted in the text area. And you can at that point click undo to return to the previous content of the text area, whatever that was. So if you don't like the look of the OCR text, then that's an easy way to go back a step. So the next to the transcribe text button is this drop down and you can select between Tesseract and Google. So this is the same as the two gadget buttons except here it's spelled out a bit more thoroughly. So if we switch to Google and we hit transcribe text again, we see the same process happens. We end up with the undo button again and the different OCR text here. If you don't want to undo, this bar here will disappear in 30 seconds or you can hit the close button and it will disappear immediately. The same applies if as often is the case you have the header and footer panels open and if we hit transcribe text, the bar appears at the top of the body text rather than at the top of this whole area here. So the other item in the drop down list is the advanced options. So if we click on that, we go to the tool forge form for the tool which preloads the image URL which is the same image URL as the image we see on this page and we can select again from OCR engines and we can also provide a list of languages. Now a list of languages means we can provide multiple languages because so if a page contains more than one language, giving the OCR engine information about what those languages are means it has a much better chance of getting quality OCR text out of it. So in this case we start typing FR, French for instance, if there was some French text on this page and we would hit transcribe. In this case it probably doesn't make a great deal of difference, it's all English text. If you run this process and you want to keep this text, it is another step to copy this to the clipboard, go back to the editing page, select all in the text box here and delete that and then paste the OCR text from the tool in there. So the other part of the advanced form is if you switch to Tesseract you get a section of Tesseract options and this gives a bunch of different page segmentation modes which are useful for different sorts of things, if you're transcribing a table of figures or multi-column things or if you don't want it to do any script detection at all. So OSD in this case stands for orientation and script detection, so that's Tesseract's attempt to find areas of text in the image and what language they're in. If you leave the language as blank it will attempt to determine what it is. In this case it's reasonably good and if a particular language isn't supported by Tesseract or Google then it's best to leave it blank and it will have a go at doing it and for some scripts that works okay. It's obviously less optimal. This list of languages will change when you switch between Google and Tesseract. There are some small differences. For instance Google has Italian only if we switch to Tesseract. We now have Old Italian which is for an anti-type in Italian language. So that's about all of it, basically this is the big new button. We do still have a bunch of open tasks that we're working on. Hopefully some of these are going to be finished before the end of Wikimania and others hopefully later this year. Community Tech will be wrapping up this work around the time of the Wikimania hackathon. So anything you want to tell us about it or any bugs you find then do so. You can contact me at any time. My username is Sam Wilson and I'll be really happy to figure anything out. Yeah, that's about all from me. Cool. Thank you. I want to follow up on Sam's demo to tell you about who we are, how we ended up working on which resource project, and how we can hopefully end up working on more resource projects with our upcoming wish list with your participation in the next cycle. So a bit about us. We're a team of product managers, designers, and engineers that are five wish lists old. We run a wish list every single year and we're going on to our 6th in 2002. And we have worked on many projects for the Wikibird system that were results of wishes that came our way via the wish list. You just saw Sam demo the OCR improvements, but we also worked on the e-book export project, which you can read about in our project page, and we also support the IAEA upload tool. And I wanted to tell you all about the wish list and get in front of your eyes so that I could encourage more of you to come participate in the next iteration of that in 2022. The wishes for OCR exports and IAEA upload were proposed to us in 2020. And that year we ran a special version of the wish list that was an open floor proposal for any contributor to pitch any product features or problems that were for projects that weren't explicitly not Wikipedia or common for Wikidata. The reason we did this is because we wanted to give more product love and more resources to other projects that aren't usually a stake center. And so Wikisource really showed up that year. We had a total of 72 proposals, 423 contributors, and over 1,700 votes. And the top four wishes that year were actually all Wikisource wishes. And we're really happy to see this and we really hope to continue seeing the Wikisource community come on to the wish list and keep voting and proposing things next. We know there's lots of love to give to the Wikisource ecosystem and we'd love to keep providing that support. So I wanted to walk you all through product proposal wish and then really quickly walk you through our prior decision process. In a sample proposal, what you'll have to fill out as a proposer is the problem statement that's really describing what's the problem you're facing, what's the product feature you want to work on and then also describing who would benefit from that. And these are really the two most important fields. The next field is the proposal solution, but that's an optional field. You can fill it out if you have ideas, but we really just want to understand the problem and who would benefit. And then discussion contributes to when other contributors can help you shape the proposal and then voting will begin and people can vote on the wish for proposal. Now once we are presented with a list of rank wishes as I just showed, with popularity being one of the main things to think about, we also think about other things such as how many users will be impacted with this wish. I wanted to mention that both popularity seems really important, but it's not the most important factor. There's other things we take into account such as impact which I just mentioned, but we also think about how complex the wish will be from a technical perspective. The engineers look at proposals, they think about external dependencies, things that are going to add some complexity to the code that we're writing, the things that we're tackling, maybe have some legal and security dependencies, or maybe it needs some more elegance database updates. So the engineers all give it a technical complexity score. Then the designer and the product manager look at how many flows inside of the experience the wish touches. How many pages will we need to update? Is it a simple button that we're introducing or is it multiple nuanced experience flows that we need to account for? Then the other one, as I mentioned, is of course the popularity. This one's really important, but it's not the main factor. And we also think about whether or not the wish will help historically exclude community. Then the sum of it all together gives us a predetermination score. One thing to note is that we score all wishes after we make sure to talk to other product departments of the foundation and ensure that nobody else is planning to work on something related to what's describing the proposals. So sometimes we get a lot of proposals for the editing flows, but maybe the editing team is planning to work on something related to that, and so to reduce double work, we also make sure that that's not the case. Once we've combed through every wish that comes our way in the wish list, we then have the wishes prioritized by this predetermination score, and we can now tackle things more strategically because we can tackle the most wishes when we know how complex each will be, and we stagger them based on complexity so that the engineers can move forward with the code while the designers research the next wish and so on. So I wanted to highlight that part of the process because the better the scope that a wish is, the better described that it is, and the more likely it'll score very high on our wish list, as well as of course the popularity of how many people want to work on that wish. We're really excited to keep iterating on our processes, keep learning from y'all, and hopefully keep working in ways where these first wishes have come our way. See you at the 2020 wish list, which will be running in January of 2022 this year.