 All right, so today we've got a few lightning talks and then we might have a panel with some questions afterwards. So our first speaker today is a user Alphama or Juan Thayn from the Vietnam Wikimedia's user group. He's president, he's also a member of the ECAP Regional Grants Committee and is a global renailer in the Wikimedia ecosystem. So one of these here. All right, he's going to be talking about using bots to boost the content development of Wikimedia's. Thank you, thank you. Hello everyone, my name is Thayn and I'm from Wikimedia Vietnam. And in the last session, I saw a lot of presentation about education, about development, community development but I would like to introduce how to use bots to boost the content development in Wikimedia and we all the project, we all the same project also. So the first thing that we need to know about how to use bots and what is a bot, actually the robot, the human, we will invent robot to help us some tasks in a real life but in Wikimedia, a bot will be a tool and it will help us to carry out some repetitive and very boring tasks like add category or just fix grammar correction. So before using a bot, we need to have some community agreements. The first one that we need to set up the bot policy and the second one you need to set up a bot approval group and somebody know about technology or related things they will form a group and they will approve or not approve any new user want to use a bot. The next thing that, yeah, we have a lot of bot tasks and maybe some of you can realize in this Wikimedia, we have a lot of bots. I don't know about in your language edition, maybe, sometimes you can see some bots working and the first thing that we have the anti-panelism bot, the second one we can see that if you create new account in Wikimedia and you can see a message, welcome message and also the other thing like fix or add template category tasks, grammar correction or typo fixing or et cetera. But the more important one, not the bot, they are more intelligent so they can create new content or they can create new actual in different languages not only English and then the other bot you can see if you add bear links like bear URL and this bot will automatically convert to the correct format and back to Wikimedia data a little bit because in the previous section, some of you mentioned about Wikimedia and we also have another kind of bot to connect into feelings and I have joined Wikimedia since 2012 so I understand why we've been Wikimedia. Of course, when you write an article in English breakage for example, like my analytics, you have to connect with, connect to all the languages version by using a box like the language code and the article name. So in this case, you can see you can have like a net and very complicated. So we develop Wikimedia just to add a central hope we store in the middle the information and we can connect the other languages. Next one. Yeah, maybe some of you here will accept bots or you will deny bot. You don't want bot working in your project. So actually this very, we have made many arguments some good side of bot that we can, bot can reduce human effort. So we can focus on developing the other thing in Wikimedia or because on that platform in the left you can see that we have 59 million article in over 318 languages and how you can patrol all of the content. We have massive content. So that's why we need to do bot to detect in article and yeah, you want to do something with article. So that's why that the good side bot and also I know in the reason right now accept Vietnamese because we have over one million and the first with Edison in the reason have one more million articles. So the trick here that you bought to create new content about the species and about the plan something like that and the back and we also have the bad side bot. The first thing that do you think that the stuff article is helpful like about the species about animal or plants. Yeah, they are helpful but with some people that they are meaningless because okay you just read a few sentences and you don't have anything else. So second one that is break the naturality. I mean that the language naturality. So sometimes you add a sentence and look like a robot maybe Sydney is the capital of New South Wales. There's a strong sentiment and then if we lazy we can depend on both and working with them. So how to create a bot is quite not so difficult. First you just need to have experience about programming and regular impression. I think these two you guys don't need so much basic but there's a blur if you know the language programming languages like Python like .NET and we have two famous tool to to start with like auto wiki browser and pie wiki bot and in the next background you can see the process how to first you have idea and then you discuss with some people in your project and you write the specification you make the proposal if fail you will continue again until you have the bot and you test your bot and after that you run the bot and in the future if you have new function if you want to fix the bot you will return and update the bot the thing I give you some example about what example why it is important so you can see that we have clue and change bot is this bot we will revert the possible so in the left you see that somebody found in wiki beta and they will write something bad so the bot will detect the content and it will revert the previous version the next one about the citation I know all of you here come to wiki beta you cannot always correctly the citation about the delinqa and any wiki comment picture or videos with the copyright and it will be removed on wiki comment so the bot will remove the link in the other wiki project also and this one is also very important because we need to use the archive bot to store the link on the internet and actually we get the link from Webeck and this guy is very famous right how many of you know just one anybody else he bought his own article and he wrote he bought to create a million species article in Swedish because he married a Filipino lady and she has no languages in this way you can see back here a little bit you can see that in the second in the fourth and maybe see wherever the next one I want to do about myself about bot because I'm currently a doctoral student in Mexico and in the meantime I'm also doing my internship in Singapore Technology and Design University so that's why I created my own bot to serve for my thesis the first thing that my bot I will have a general fixing and daytime I add so here in Vietnamese I will add some categories in the first line and the second one I want to change to Vietnamese text and I convert the daytime for English to Vietnamese so you see the previous bot example is quite simple and gravitated it's a new principle but right now with the development of deep learning and neural network we have another bot with a barrel and they are smarter the first thing I want to introduce about content development bot and some of you in previous session also mentioned about generating wiki data to text and that is exactly what I'm doing in my thesis the first thing that you have you want to create new influence article in any languages so what you need first you need to have the knowledge base on the left we have the wiki media infoboxes or you can have the wiki data version and then you put that into the SQL network or deep learning network I don't want to go into detail about this but just let you know the method and it will automatically generate the text in the right table and you can see that we have different method so I just show you some examples actually we have a project on that so it's called abstract wiki data as many of you know here or wiki function so first we have the format we have somewhat similar to chipo but here we have modifier also and then based on that format we can generate the text in different languages so I think about future of wiki media and I think the bot right now I don't know in your language but I think in English and German right now are two most developed bot wikis and their bot are very intelligent and they can do more functions so I think I think about the model that we can have the human bot collaboration so it means that human and bot can work together and have others improve the quality of content in wiki better so this is an example you can see very ordinary the green line is bot and the blue line is human so at first the bot will create a very simple sentence and then human come and then upgrade the version and the bot will automatically detect and add the category and add external link and add the template and for the reference the bot and human can do the thing at the same time so that's why we have this article so is it more convenient instead of you need to search for the reference and then you combine everything by your own in the beginning the bot have you can do that I expect in ASAP because I joined ASAP conference two times ago and I don't see anybody recommend about this so I expect you have your own bot project and you have some you can set up some bot policy and do something with bot in the project and then we can have the model to improve the content development in your project and that's my definition thank you we'll have some questions we'll go through the slides so I just realised I've introduced myself before I started speaking I'm Alex Love I'm the secretary on the committee of media Australia I have spoken to many of you and if you do want to talk to me later before you finish up I'd love to meet you and talk to you next speaker is Sam Wilson Sam is a software engineer on the community tech team at the Wikimedia Foundation although this is the opinion he's giving his own he likes to make that clear on the Wikimedia Australia committee for quite a number of years and he's been a resident indispensable tech guru so we're absolutely very reliant on Sam to fix our website and all sorts of things like that so Sam is going to talk to you about indexing Australian content in Trove many of you may have heard of this system for the National Library of Australia thank you Sam so hands up who's heard of Trove the Australians so Trove is a project run by the National Library of Australia to index catalogue from the National Library and lots and lots of other Australian libraries and institutions so it's a sort of meta catalogue library catalogue and archives and a few other things one of the things I've been dreaming of for many years is to get Wikimedia content indexed in Trove so that when you do a search in Trove you come up with records from your local library and from Wikimedia sites so that's what I'm going to be talking about in major categories of content that they're listed here all of them have some correspondence to Wikimedia projects, some more than others some are harder to match you'll see the bottom there, people and organisations no longer accepted there does seem to be some question around including biographies and profiles and businesses and things like that so you can see books and libraries that's the bread and butter of the library catalogue Wikimedia doesn't host much content but we would index in Trove but all of the other Wikimedia sister projects do so we've got monographs and scores and manuscripts and records photos of course on commons data sets on commons also but that's a growing part of commons that not many people are taking advantage of at the moment audio and video and music transcripts of oral histories and the audio itself maps and digitised maps as well as map data and it would be brilliant if we could connect these things from Wikimedia so the parts of this process that I want to focus on primarily its Wikisource, Wikisource hosts lots of text material a lot of its published works and that maps really well to an existing library catalogue so if you search for a published work in Trove and it's a book you'll find the record of the book and it will list all of the editions and all of the libraries in which you can find those editions and it would be wonderful if we can get Wikisource listed as a library alongside all the others with a very prominently click here, you can read the book you can download the EPAP or the PDF all of that and you can also find the documents especially photographs on columns so the National Library has an existing system with Flickr so if you take a photo and you put it up on Flickr you can put that into Trove really easily by adding it to a Flickr group called Australia in Pictures and then three times a day the Trove bot will run in search results the image itself, the image file remains on Flickr and doesn't get copied into any database on Trove it is just the metadata so we can do that with photos and individual files, audio and video as well from commons and part of that you can see in brackets at the end of these indicate where we're storing our metadata so Wikisource we store our metadata in Wikidata commons files we now store our metadata increasingly in structured data on commons it's a bit of a distinction because it does mean certain things about how we structure our image records for instance if you take a photo and you crop it into another file on commons we represent that as two files we don't have a concept of a unifying photo record for those two things we treat them as newly separate things sometimes we might create a category to put them in but more often than not we if it's only two or three derivative files for a photo we won't create a category that we can identify for a given photo with multiple derivatives so that's a bit of a challenge because we wouldn't want to export multiple formats forms of the same work into Trove one way around that as I say categories and so possibly the way to do this is to use categories a bit more Wikipedia is where we host obviously biographies and articles about places those would be very suitable and sometimes articles about individual books and newspapers and all sorts of things that do exist on Trove but this project isn't really focusing on the way of getting those things into Trove at the moment I think that will come but it's more the bibliographic and photographic things that I want to focus on so the next part of the once we've figured out a way of structuring and exposing our metadata and figuring out the mechanics of that is actually what do we want to export to Trove what do we want to make searchable in Trove and Australian content is the answer and that's quite difficult to define we might say well you know all Australian books Australia as a concept I think we were talking about this on the cruise last night about that Australia didn't exist until 1901 as an entity that exists in Wikidata anyway so if we were to try and write queries to say find me all books ever published in Australia we would be leaving out a whole bunch of types of things and we might also inadvertently be capturing a whole bunch of things that we wouldn't necessarily think of as Australian the same goes for photography on commons I was looking at Wikishoot me just now around the hotel there's a whole bunch of nice photos of individual motor vehicle models well documented well captured but I'm not sure if we would call that Australian content but they happen to have been taken in Australia they may not be the types of things that we initially want to make available in Trove so the idea is at least to start with is to make it an individually opt-in process that the Trove template on commons and on Wikisource and on other projects will be added to the page or the talk page or the file page or whatever is to be exported and that's another tricky point because on Wikisource we have lots of works that are actually sub pages we might have a newspaper that's been transcribed and we want particular parts of that to be treated as works in their own right or we might have a book of short stories or there's a whole bunch of things in which a sub part of it needs to be treated as its own work so that the project gets exposed as a solitary work but it would still link back to its parent of course so the Trove template will operate on in a whole bunch of different situations and as I was saying we want to avoid having if you put the template on multiple things that should be considered together we need to make sure we're not exporting those as multiple records and we're not duplicating it effectively in Trove the other part of this that's quite interesting is we already have quite a few thousand or tens of thousands of items within Wikidata and Wikisource and commons that already exist in Trove and we don't want to be exporting those and making them appear to Trove as new works we want to make sure that where we have taken something from an existing state library or the National Library or other institution that we are recording the Trove ID against that item so that when we're re-exporting our metadata Trove knows that it already knows about that and it can link us accordingly which is another tricky step on this so the envisaged workflow is that you add the Trove template with no parameters this gives a nice little human readable box that says what's going on and adds the page to a category this can happen on a talk page or the main page or in a Wikisource header template or in a whole bunch of other ways the Trove tool that I'm working on scans that category, takes all those items creates the required format for exposing this to Trove which is an OAI PMH feed or there's a couple of other ways in which we can do that I won't go into the technical stuff but that will make a feed available to Trove we tell Trove about that Trove periodically a couple of times a day a couple of times a week or something scans that feed, imports everything and in that importing process they're assigning a new Trove work ID to those items our tool then looks at their API finds those work IDs and adds them back into either the wiki data item or the structured data on commons and then the displayed template changes at that point and says hey look we're also on Trove, click here to see where you can find this photo in a state library or a local library near you for a book or other things like that yeah so that's the sort of wiki media wiki media user facing process for how that'll work the tool forge tool it'll have a it'll be running periodically it'll have a web interface that'll show what's going on and that will also be highlighting problems like if someone nominates a work for I think we're getting blown off the top of the building now um if someone nominates a work for exporting into Trove and it doesn't have all the required metadata or there's it's saying things that are inconsistent or that we detect as being not able to be exported then we can highlight that and we can let the user know we can send them a echo notification on their wiki or there's a few other ways in which that can work there are a whole I've been looking at the existing works from Trove in our projects and there are a whole bunch of inconsistencies around the data and that needs to be cleaned up and we need to make sure that we're guarding against creating any further inconsistencies and problems with that Trove also provides a whole bunch of other metadata about other catalogs like catalogs that it is aggregating also have their own IDs and we can be bringing those in as well once we've made the connection between our identifiers Trove's identifier and then we can farm out and get a whole bunch more and that's a really interesting process I think often you'll look up a you might look up an image on Trove you'll find that the image that actually all it is is a link to another library and that other library might have a completely different system of actually viewing the image they might have a like a common one is a sort of a zooming viewer where you can zoom in and you can do stuff and there's a lot more functionality that's not exposed through the obvious APIs so where we are with this this process is a very initial steps I'm sort of looking for anyone who's interested to help figure out the problems here and make sure it's not doing the wrong things and and then we're going to work through I'm going to make sure all of the existing IDs are sorted out and are good quality data and not going to be duplicated when we re-export then there's a money aspect of this Wikimedia Australia needs to sign up as a Trove partner in order for them to sign us up as we're going to be able to harvest our metadata that looks like it's somewhere in the order of $2,000 a year but of course there may be possibilities around negotiating that I'm not really sure if we want to appear as a library within Trove we need to get a NUC symbol allocated to us and I'm not sure about the process of that but I'm sure there's librarians who know a lot more about this than me but the initial stages of adding Trove to to existing items we can work through all of that and then we can start adding the Trove template to new things and things will be exported and everything will be interlinked that's all I've got to say if anyone wants to talk about this there's a page on Meta and yeah I'm around this afternoon so cool thank you thank you Sam okay and next speaker is the Dr. Kerry Raymond he's a retired computer science professor and a researcher with a passion for local and family history she combines her skills and interests in finding ways to use IT to improve how we research history she's a very active contributor to Wikipedia on Queensland history and geography topics and there are thousands of articles any one of the thousands of articles about the Australian state of Queensland we've been edited or corrected by Kerry as I was saying or created so thank you Kerry he's going to talk to us about Web2Cit Web2Cit yeah there's no way okay thanks Kerry over to you thank you Alex okay so I'm here today to talk to you about Web2Cit or how I call it is how to train your dragon I'm Kerry Raymond I am Music Kerry Raymond at Wikipedia Wikimedia Australia so very simple I'm just Kerry Raymond okay why are we interested in doing this Wikipedia likes online citations why do we like online citations because it's easy for our readers and contributors to verify the content online citations also make it possible for the reader to explore the content in greater depth so there's lots of benefits to having online citations on Wikipedia so if we want more online citations on Wikipedia let's make it as easy as possible to create them now who here is familiar with Cytoid which is also known as site automatic and visual editor not as many as I thought you should be getting into it now Cytoid is the first step if you're using it at the moment you will know that if you give it a URL sometimes it gives you a nice citation sometimes it doesn't sometimes it gives you a citation that's sort of okay but you'd really like it to do something better so Web to Sit is the next step where we go from what is essentially a very fixed way of trying to create these turn a URL into a citation to a way that you can provide a bit of input to to get a better outcome you can train it so if we look at something this is a Queensland Government website about the parks and forests of Queensland we have hundreds of national parks it's a reliable source and often cited in Wikipedia now what is the problem if you go with Cytoid what happens is you get an awful lot of mumbo jumbo appearing as the first name and the last name now these web pages do not have individual authors if ticking up this random rubbish so we'd like to get rid of some of that stuff so what we'd like it to look like might be something more like this with the title with the names parks and the fact that it comes from the Queensland Government which is about the authority of the citing question so it would be really great this is a great source to cite but at the moment Cytoid isn't doing a very good job if only I could train it to get it right so how do we go about it well if you're using web to sit what will happen is when you give a URL it will do a Cytoid citation and it will do the web to sit citation if it has one and then you can choose which one you insert and of course like any of these you can further edit it if you wish but the game is to try and get it so good you won't need to so just a bit of a clue about how it's written any of you are familiar with Zotero you might it uses functions from Zotero and if you want to get to the training module it's down through that link at the bottom so what's going on in web to sit well the training tool is a set of forms you fill in you start by providing an example URL from a web page from the site that you're interested in and for each of the citation fields you need to say if that field is required if it's required where do you get the information from and how do you transform that information into the correct format so sometimes you need to do a lot of steps sometimes you don't so I start by saying I don't need that last name that gets rid of that piece of rubbish I don't need the first name that gets rid of the other piece of rubbish I say yes a title is required but site I got the title right so I don't have to do any work I just say oh use Cytawigs title website hmm I'm going to require that but because we know exactly which website we're dealing with we can just put a fixed value in there parks and forests the publisher it's always going to be the Queensland Government a fixed value so that's how you train it and that's how we get the results you saw over here the nice one that we wanted so one of the other problems you will encounter when you're using siteoid is that sometimes it won't find the author and here again is another Queensland Government website the Queensland Government websites don't work very well siteoid in which case we're missing the author now in this case there is an author it's a Government Minister called Mark Ryan so how are we going to train siteoid to get it right well this is where you have to have a peek inside the HTML and find where that name appears now you don't really have to understand the HTML but what you have to notice is what paragraph or other HTML structure it appeared in so once you've got that you're in business so what we do is we say to Web to Sit I'm going to teach you how to get the author's last name I want you to use XPAR to get the text from that thing that I just copied and placed out of the HTML don't understand it I just copied and pasted it and what will happen it will give us back this Minister for Police Blah-de-blah-de-blah-de-blah the Honourable Mark Ryan I then say split it around the honourable this gives me two pieces the bit before Minister for Police etc piece two Mark Ryan I say I just want piece two Mark Ryan I then say see that space split around the space this gives me Mark and this gives me Ryan if I select item two I now have my last name item one I now have my first name so it's a fairly simple process to transform the information that you find in the HTML into the fields we need for a Web citation on Wikipedia so in this particular case what you will see there's suddenly Mark Ryan appearing as the author of this work the other thing it wasn't picking up before was the date where I did the same trick I found the piece of HTML that held the date and the way it went and incidentally Web to Sit is very clever if it said yesterday or today as that particular government website does say it converts it into the date for you so that's very handy and I made a few other little adjustments but they were just fixed values so nothing exciting so where is Web to Sit well the training tools live on Toolforge but to be honest I don't have to sort of go there because you use the training rules the training rules which are created by the training tool are on Meta that's where we store the rules now that is quite unusual to store something like this on Meta as opposed to Wikipedia one of the Wikipedia's the reason it was chosen is because it makes the new citation format available to every language Wikipedia the negative of doing it is we cannot include any links in the citation fields as they will not necessarily be present in all Wikipedia's particularly out in our language ones of course you can find the project documentation on Meta that address but most importantly if you want to use Web to Sit you must follow the instructions to install it this is putting a couple of lines of text in your common.js if you know about common.js you're kind if you don't well someone else can do it for you now remember only one person has to set the rule up has to train the website everybody else can then benefit from it so that's the great thing so what's the status well the project which was funded by the Wikipedia foundation has been completed the prototype has worked all my examples you saw there are for real I would like to congratulate the project leads Diego and Evelyn from Brazil and Uruguay respectively I have started training my dragons I've done several of these now most of you weren't too successful but you know I got better with better practice next steps which we might do perhaps through the easy app structure is perhaps runs in training sessions for those of you who would like to learn how to train the citations with Web to Sit perhaps we might form a dragon team for mutual assistance in doing that and we also can encourage other contributors to use Web to Sit but not necessarily to do the training and find that too hard and of course provide feedback to future development of Web to Sit thank you thank you so much Kerry I can hear some common.js js is being furiously edited as we speak particularly from New Zealand yeah any we have any questions for any of our speakers yes this is for Sam I couldn't find doc Colin Troll okay thank you you probably know what's coming Kerry can you use this for reports or other types of publication as long as it's got it works from a URL so as long as you're talking about an online journal article or online report you can instruct it that the citation type is journal or document or whatever so all of the normal types are available any other questions comments, feedback another one from Amanda a question for Sam have you consulted the National Library about your project and because just everybody seems to want to do something with the National Library's API and they seem a bit resistant to the pressure no that's a very good point but I didn't want to be premature and not do something that wasn't going to be possible I want to get our side sorted getting our side sorted is useful anyway even if they don't want us yeah Kerry I've got a question for you is this like replacing Cytoid it seems like it's duplicating it in a way shouldn't Cytoid be able to do this I think this is a question for where it goes stage two a little bit if enough people like it and think it's a great idea I imagine it could replace Cytoid in the long term at the moment you get the Cytoid one and the web to sit one so that's an intermediate sort of position but where it goes in the future it could become and replace Cytoid thanks Kerry alright final call for questions oh Toby no you're putting your hand up go on we've got plenty of time a quick comment to Sam Mike Dickerson and one of the state librarians here have been talking about a slightly similar thing to try to get WikiSource books onto the WorldCat and OCLC so we'd love to talk or include you in those conversations thank you Toby anyone else alright thank you everyone for coming to the session I hope you learned a lot about tools and bots and API whatever it is Sam is doing lots of curly brackets and templates and so on