 The title I love a good pun and a movie reference, so when I came up with this one I didn't really put a subtitle that actually explained what it was So if you did click through and look at the abstract and work out what it was about and you still came along Thank you very much for it's much appreciated Well, what I'm doing looking at is fixing geographic duplicates on wiki data and sebuana wikipedia and Geo names. So as I said, I'm Alex from wikimedia Australia user cannelly on wikimedia projects and Meta Cortex on Twitter, sorry X and Mastodon and github and the github is important because I'll because a lot of this talk and the I guess the code and queries and so on I'll be putting on on github I basically want every way to have everywhere in the world have a comprehensive Geographic data on wiki data. It's just so useful for all sorts of other projects and analysis and using wiki data to its maximum utility I suppose So here's the github repository. I mentioned Github Meta Cortex spelt in a very strange way slash god dup I'll show it at the end if you if you want but as I said, it's a little bit technical I'll be showing lots of queries and You know techniques and and so on and use of the using the API But yeah, you can I'll be putting everything on there And so if you are interested particularly if you want to to work this for you know, apply these techniques for your own country or Or any data in in in wikipedia and wiki data, you can you can get it on there and feel free to fork them adapt them Contribute anything you want So what is the problem? What are we looking at is a particular case for Altearoa, New Zealand We can meet Australia. We work very closely with our friends and colleagues and cousins across across the ditch as we say in Altearoa I'm looking at New Zealand because they have a very good definitive data set from the from land information New Zealand And that's what I'm using to to sort of I guess detect where there are duplicates Existing in wiki data. So the problem is that from 2014 to 2019 LSJ bot which you might have heard of there's a wikipedia article about at the bottom there Generated 9.5 million articles on the sebuano and Swedish wikipedia's. I believe the Swedish wikipedia deleted a lot of these articles But a lot of the ones on the sebuano remain In the years since other bots then went and created wiki data items for the the articles that were in the the sebuano wikipedia So what that means is that? Yeah, there's there's many many thousands of duplicates Mostly for geographic places, but for all sorts of the topics that that were generated here So LSJ bot used Various data sources to generate these articles For geographic places and objects that used a site called geo names Which and this might sound familiar is that openly licensed user contributed global gazettea So it's all released under a creative commons license. It's free. It covers the whole world It's an amazing resource, and I just want to say I'm not this is not a criticism of LSJ bot of geo names of sebuano wikipedia of Anything this is it's kind of more a cautionary tale of the perils of I guess, you know Doing large imports or you know bot generation Without you know sort of maybe doing a bit more sort of care and is it but ironic? I will be covering you know sort of I guess automated ways to try and clean up some of this data And so just to keep that in mind if you're if you want to sort of adapt this or use this And I'll be trying to keep it in mind too just to be careful about When you're doing any kind of large-scale or rapid work like this So geo names websites their geo names org as I said, it's a very comprehensive Sort of global gazettea with coordinates and names for all sorts of places all around the world every country Unfortunately, there are some issues with the accuracy the coordinates are often highly inaccurate. This is quite strange. They're actually The data sources list that they have used land information New Zealand gazettea data and statistics New Zealand information So I'm not sure why the coordinates are almost universally wrong. I had a few theories about it I think there was an academic paper about it Which where they were had some theory about bounding boxes Someone suggested to me before that it's actually they've used digitized maps and actually use the location of the name label Which is why they're all a bit off So the name label was never actually where on the point where the you know mountain is or river is and there's a lot of a lot of the rivers will have multiple coordinates and That could be because you know a river will have the name of the river at numerous locations So I'm not sure if that's the reason why but there's some very strange issues with duplicates and and Coordinate's being quite significantly wrong and I mentioned the duplicates so where objects have been imported to geo names multiple times So this seems to have happened with geo names. I think if there are issues in other countries They could be different or they could be the same but it looks like And I think this is to do with the way land information New Zealand does categorize everything it categorizes all mountains as hills and all hills as hills It categorizes all rivers of streams. And so I think someone has imported all the Mountains as hills and also imported them as mountains. So that's why there's a lot of duplicates with these So I've done a chart of this is the the raw geo names data set This isn't on what's on wiki data, but this is straight from geo names where you can download an extract of each country's full data set They've got these codes here, but as you can see stream STM is the far and away the most Common one with the most duplicates and hill mountain hill and MT mountain and I think peak as well So there's all the I think yeah There's a lot of duplicates across all sorts of categories But I think to to deal with most of them most efficiently if we look at the streams, you know rivers and hills and mountains You know that will that will fix most of the issue as quickly in the most efficient way So as I said the duplicates problem seems to be mainly occurring for certain feature types mountains and hills and rivers and streams So let's look at some examples of Picked some kind of random names and then looked at them in the in the New Zealand Gazette here It's the official authoritative data set and also in in geo names and wiki data So I looked at blue mountain. It's five results. You can see where they are on the map there This is in geo names, and that's how the geo name search results shows up I've filtered by mountain hill rock because quite often there'll be a town or locality with the same name It's another geographic feature So I've filtered by that type And it looks like a lot more, but there also it's also picking up, you know, blue hill and Blue mountain pass and so on in the search results But if you look at which hills or mountains are named blue mountain in the geo names data set there's seven results So already we can see there's two more results than we should have Here's a plot of the map it's a little probably a little hard to see there But you can see the three at the top on the South Island a bit lighter and the two below are the are a bit darker It's got a bit of a transparency on those dots. So it just means that There are two points there two items called blue mountain very close to each other and pretty much on top of each other And that corresponds to the the comparing the five results to the seven results that there are two duplicates here The other one is a why roll a river Filtred by stream type. So there's a couple on the North Island and two on the South Island seven results 23 results in in geo names And it looks like a lot more because it's picking up anything related, but Yeah, it already you can see particularly with rivers and streams where they're usually represented by a line same with mountain ranges They will often have a quite a long range of duplicates and Here's the plot of those part of this blue dots You can see it generally matches the the Gazetteer one, but you can see there's sort of quite, you know quite a few more dots and lines and so on So how can we pick these up? There are two ways you can you can do a geospatial query But what I've actually done here is used. How can this be fixed? Hey, I'll just take you back to going back to how to fix the issues on wiki data First of all, there's a mix and match catalog for the New Zealand Gazetteer that I've set up a couple of years ago It's going quite well. It's got about 60 percent matched But yeah, some of the difficulty in doing these matches is and reconciliation is in all these duplicates that exist So, yeah, that's a but the first step is to is to reconcile items to to their their lins ID land information New Zealand ID Second is to identify the duplicated items. So as I said, you can see there's that for those two examples. There's there's quite a few Then to remove the sebuano site links and merge the duplicate items in wiki data and Then replace the coordinates as I said, even if there's only one item the coordinates are usually quite inaccurate So we can use the the lins Gazetteer ones which are more accurate and replace them And those last two can be done with the quick statements tool on tool forge People have been fixing the manually myself user prosperity user shaky aisles have been, you know going through and manually merging wiki data items It takes quite a you know takes a few minutes for each one And so this is why I've been working on a on sort of more sort of automated techniques to do this And yeah, so I mean but looking at which ones have been manually merged and how I've done this is by querying where there's multiple Geo names IDs so where they've been merged Each item with a single geo names IDs been merged into a single wiki data item. It'll have a number of IDs. So that's given me a vague idea of how many and as you can see only just over a third 37 percent have been merged But there's still from what I identified With various queries was that they're still about 63 percent to go So what I've used I guess yeah, the part about identifying the which ones are duplicates. I've used clustering which is It's usually counted as machine learning Technique or algorithm, but it's basically just a geo statistical algorithm And as I said, there's two different types of points like mountains where their mountain peak or summit is is a point But there's also those rivers and mountain ranges which are represented by lines And I found that two different these two different I get algorithms. I've used work, you know Better for one and the other so for mountains and with the points like this k-means clustering works really well It detects where there is no cluster where the Arctic where the item doesn't have a duplicate And it sort of just puts them in cluster zero But the other two we can see it's identified the two which showed up on that map before it's a little bit darker Which do you have duplicates as cluster one and two down in the in the middle of South Island? Why row a river for so for rivers? I found that this the algorithm DB scan works better for linear features like river streams and mountain ranges And if you can see it's picked up the the seven duplicates there And as worked quite accurately I used I did use the k-means algorithm on this as well And it wasn't quite, you know as as accurate as this one So that's what I would recommend is using DB scan for for rivers and the k-means for mountains Okay, and then so we've identified our duplicates. What do we do now? We want to merge them in wiki data This can be done in quick statements as I mentioned You need to remove the the sebuano wikipedia site link because that will Because you can't just merge it if it's got two site links. It'll clash And you can do that by replacing it in quick statements with an empty string so the top line there I'm saying for that item which I've identified as a duplicate on the site link said wiki And I'll put an empty string and that's removed the the site link And then you can use the merge command in quick statements to merge and redirect to the target item You'll probably still have to do some other cleanup remove duplicate coordinates I would leave the geo names IDs because it shows that they you know that you're that I must be merged Okay, a fixing the coordinates is also pretty can be pretty easily done in in Quick statements as I said we have a We have a way of yeah, we can we can use quick statements and we can run a query to list geographic coordinates Thankfully the ones from geo names. I'd usually have a reference saying they're from the sebuano wikipedia So you can do a sparkle query to get a list of coordinates that are cited To the sebuano wikipedia and then you can use quick statements to replace the coordinates with them the more accurate ones So, yeah, there's a Yeah, sort of I guess the screenshots of the that process from running the sparkle query getting result the results of the coordinates that are Reference to the sebuano wikipedia then using quick statements to remove the that the inaccurate coordinate and then add the New one in and I also included The reference to say that this Is reference to the the New Zealand gasseteer? so that's Yeah, that's how fixing it in wiki data, which is probably the easy part But I didn't want to just fix up wiki data and then leave, you know, I want to really close the loop on this I guess in a way to make sure it doesn't happen again That someone uses, you know these sites so I actually want to Also fix up the sebuano wikipedia for for New Zealand and you know potentially Other countries and also geo names. So, how do we how do we do that? so I've downloaded a dump from you can download dumps from I think it's dump stop wiki Wikimedia.org you can download a dump of any wikipedia current versions You can download all the all the versions and all the metadata and all sorts of information Whatever you whatever you need so in this case, you can I'm just downloaded the current version of the sebuano wikipedia As I said, I've been running the the analysis and those clustering algorithms on the geo names data set, but as I said, a lot of them have already been manually merged in Also, they've been manually merged in in the sebuano wikipedia and in wiki data. So Could be a good idea. You can query wiki data, but you could also, you know identify where the actual articles exist Another problem is if if it's been manually merged You know by myself or others, we've probably just deleted the site link and haven't merged or redirected the article so it's just sitting there with not connected to a wiki data item So that might be a what you could you could potentially use this dump to compare Where there's an unattached? Sebuano article with no wiki data item and then you can you can merge that properly So I guess yeah, this was a whoo. I was quite relieved. I only did this the other day got it working Which was which is good because my whole talk was kind of predicated on this was using the media wiki API So you can actually use this for all sorts of things, but I was what I was trying to do was Do a merge or sort of redirect an article? Essentially, I was replacing the content of an article which had been which was about a Geographic object, which I'd identified as a duplicate and then redirect it to I you know, whatever I decided was the You know the the the core one the source of truth the the single one that I wanted to to merge into So yeah, I've managed to do that using a waf What I'll try and do is is set it set up, you know This code in the in the github repository so that it can use you know Well so that anyone can log in using their account get centrally authenticated using a waf and then and then You know they could run run, you know, these These bits of code to sort of merge these articles in wiki data and in the wikipedia But in this case, I've just set up a sort of very basic. It's just works from my account Canly and and yeah, I'm just sort of you know request a token sends me a token back and then I can Connect to this so it actually did work, which I was very Delighted to to find but yeah, I hadn't actually used the the dump at this point I was just sort of using some of those examples that I showed you before and and merging them in They want a wikipedia So yeah, the top the sorry the the first one is the link to the dump the second one is is a is postman Which is a an app for interacting with API's or building API's You can all there's other many other ways to do it You can do it in a web browser. You can do it programmatically in Python code or our code or however you want to They've got all sorts of ways you submit a post request Use your token to authenticate that your account and then you put in action. You want to Page ID or you can also use the title To identify which article you want to merge You put a summary also put merging with duplicate and you put text Which is actually the whole article text but in this space because I just want to redirect I just put hash redirect and then the name of the article I want to to redirect to and as you can see it's returns this Jason snippet saying that was success and it successfully merged the article and when I look at the page history that article You can see that I've merged it into another article So this was you know, I mean this takes took a few minutes I could probably could have done it manually in the same time But it just means that it is possible to programmatically write code to to sort of more rapidly do this you know to identify the duplicates and then to merge them into You know what you identify as the main article and how you do that is another issue There's still issues with the the coordinates on the seborna wikipedia because they are from geo names And you could I guess you could do that programmatically too. You have the dump of the article text You could you could potentially edit it, but so depends on how far you want to go, but just what I'm looking at here is Is Yes, it's sort of fairly a simple way Trying to try to get it as simple as possible and just to show that it is possible to programmatically make these I to identify the duplicates and be to to to Generate the auto, you know to clean them up and tidy them up in wiki data and wikipedia So as I said, that's quite a complicated process. I'll try and set it up So it's a bit more and and just to just to be clear I do I do want this to be a you know to be as like I'm working on New Zealand here, but I do want this to be sort of I guess usable for any country or even it doesn't even have to be geographic objects anywhere Where there is a bot-generated article Geographic places of course work particularly well with with you know coordinates because they're they're literally meant to find You know things that are near each other But yeah, it's a pretty complicated process how to use the API how to get that authentication and so on so I'll put all these instructions on the GitHub repository. I've set it up now. So you can go there now I'll put the slides from this talk in there There's really not much Yeah, there's not much on there at the moment, but over the next few weeks and months I'll be I'll be adding more and more mostly sort of I guess, you know kind of a journal of How I'm doing it to what I'm doing? Oh, there'll be some data sets. You can download. There'll be some reports and queries sparkle queries and so on that you can that you can use to You know just see if it works for you if you want to contribute But yeah, I mean if you I will be working on the New Zealand data set and cleaning that up over the next few weeks And yeah, if you want to contribute you can but I'm you know should have it completed pretty soon Okay, and closing the loop going back to to geo names So what I said, you know genomes is amazing data set open licensed freely available very comprehensive. It just has a few issues and It's possible to To resolve duplicates, you know have you know redirect them or merge the items By I think you report them on there on their discussion forum and the you know Administrator can can you know, you are so what I was hoping to do was you know use the The clustering algorithms to generate, you know lists of duplicates These geo names IDs are duplicates of you know, maybe go with the first one the earliest one the lowest number And they say and they these ones are duplicates and can you merge them and post that to the forum and hopefully Administrator hopefully I've seen them do it. I don't know at this scale. We're talking like it. You know probably about 2000 rivers, you know 800 mountains. So, you know, maybe three three thousand items I can point them to the the GitHub repository, I suppose You know and see if they want to do it, but hopefully yeah, this is just trying to sort of close that loop Sort of once and for all Yep, there's the GitHub repository again As I said, I'll put the slides up. I'll put a few sparkle queries in there that I've been using and Yeah, I can Yeah, and I'll be adding to that Significantly with how to use the API how to run queries how to run the clustering algorithms Yeah, if just in case you want to do it for if there's issues that you've identified in Wikidata in sub one of Wikipedia or in geo names for for your own country and hopefully We can get the whole world cleaned up Okay, and yeah, so I said the geo names and the New Zealand Gazetteer are both under Creative Commons Licenses, so yes, that's a works very well with all our wiki world So that's all I've got. Does anyone Five minutes four five minutes Anyone has any questions? Are we happy to have them? Yeah That's great stuff really is wonderful to see this problem being solved So about 10,000 things fixed for New Zealand a couple of hundred other countries So quite a bit a lot of work Maybe we could send Sverka Johansson the invoice for But it's a I was just wondering how is this gonna scale and I am a bit concerned that geo names Just does not have the capacity to actually merge That data and hopefully we can stop someone else from running a bot at least to create those with some point Yeah, that's something I'm a bit worried about too just the the I mean like I said I've seen I've seen it happen like I've seen people posted links, but we're talking like a list of about you know 20 or something duplicates that they've identified and we're talking like Hundreds and thousands here and as you said that that's just for New Zealand The whole world There's a lot of there's not a lot But there's a few articles about it and blog posts where people were saying you know the geo names data set is amazing but you know like it's it's kind of like Amazing and terrible. It's the best we've got like it's one of the only global You know and consistent gazetteers that you know covers the whole world and if you want to compare, you know One country with another, you know what within the same, you know consistent format. That's a It's a great way to do it But yeah, it has these problems and and you're right. Maybe it's a I mean I'm just like I said, it's just an assumption that they can fix them and they'll want to fix them They have if they do have a paid, you know professional Subscription service, which is quite, you know quite significantly expensive, which they And they say that they you know, they run all these, you know clean up, you know, all sort of they Clean up algorithms and so on and you know, they're very well, you know carefully checked But so yeah, I think it would be in their interest to have you know Clean data wherever possible. Yeah, that was going to be my question Whether you'd had actually anyone or talked to anyone actually at geo names to see whether They're prepared to do any of this cleaner No, I hadn't As I said, it's just I was just looking have you got plans to Yeah, I probably should and and and the sebuano wikipedia to like I sort of someone asked me, you know Have you have you talked to anyone at the sebuano wikipedia and I was like, no, I probably should say yeah Yeah, that's that's you know, you're right. That's a a very good point. I am kind of you know Waiting into to sort of waters that you know, and I don't want to tread on any toes But you know, I'm trying to try to be helpful But but yes, it's definitely a good idea to talk to the people who are involved in these projects on your screenshot of the Of the redirect on sebuano, it has below saying that's in the category of redirects. It has wiki data item Oh, yes, and I was surprised about that because I thought you would have already cleaned up the big data Oh, yeah Yes, very good point. Well spotted Denny's right. There was a there was a Yeah, there was a category on the sebuano item that I merged saying that there was a wiki data item attached to the redirect And that was because I had done that first and I hadn't redirected I have since done it So I think that that category is now gone. But yes, yeah So first of all, thanks for doing this work is very much I'm looking forward to to start using this My I have two questions first is do you have any plans to make this? Maybe a web interface something that could be easier for other people to use Yeah, yeah, I'd love to I'd love to sort of do a you know Web interface or something on tool forage or something. Yeah So at the moment, it's just all this code and I'm just sort of trying to get that working But but yeah, my aim is to have a sort of you know Public interface that people can use and and you know sort of something that will show you know The I think these are clusters, you know, do you agree just to give you that human touch? Yeah and second question is What you're talking about and which are focused on your work so far is mostly merging duplicate items But one thing that that I noticed that happens a lot with the items from the created from the sebuano wikipedia is that it actually they actually have different items for Things that normally are conflated in in wikidata. So for example, a Municipality and the city that is the seat of that municipality and I would personally like to separate those items Some people yeah prefer to merge them But I think in wikidata what in wikipedia it makes sense to have them Joint but perhaps in wikidata make sense to separate them I'm not sure if you thought about this like the opposite operation. So it's pleating based on sub one or some other I'm aware that's a very yeah, that's sort of a bit controversial when I've tried to avoid it with this But but yeah that there is a whole and and you can see that on the GNM's forum There's lots of discussion about you know someone saying these are duplicates or something. No, that's a city That's a municipality and they're different. So, yeah, there's a lot of debates on that. Thank you. Thanks for the question Yeah, so I had one comment in two questions, but one question was just answered ask and answered So I started with the comment I would say we still have to thank maybe LSJ bot and their creator for unearthing the problem for us So otherwise, maybe we wouldn't even be aware of it And now the smaller question is since you edited sebuato wikipedia in a automated manner quite a lot Did you never got any contact from them or from from other? Editors from sebuato wikipedia or any other wikipedia that you have edited with your scripts No, I have I mean because I haven't done it very I mean, I've only you know recently worked out how to do it So it's yeah, I will try and you know get in touch with Okay, I haven't done anything bulk editing yet. It's just just kind of a proof of concept. So