 OK, this is the session. This is just a demo. It may not take the whole half hour, but we'll see how we go. What I'm going to do, what I'm going to show you, is how OpenRefine works, how you can use it. And some things I've done, I've been using it heavily over the past few months for some cleaning up of some institutional data that we have. So this is the main home page of OpenRefine. It's open source software. You can go to openrefine.org and get it there. There is a official release. If you go to the download page, you'll see that. That doesn't have all the features that I'm going to show you today. So I'm actually going to, and you might be interested in using the latest GitHub version, which is available from their GitHub site. And if you ever use GitHub, it's the very straightforward procedure of cloning the Git thing. And I have it in a little window over here. So one thing about OpenRefine that I should mention is, which is different from some other tools that you may use, it runs locally on your own computer. It's not some central service. I don't think it's anybody's installed it that way. It probably could be used that way if you want to. But it's very easy to set it up on your own if you have some of the development tools like Git or even at least Java installed on your system, then you can use it locally, as I'm going to do on this laptop. So you can do a Git clone. That takes a few minutes. I'm not going to do it here. And then you just do a build. Just do outside refined build. And that just takes a few seconds. And I'm actually going to run it. Another window, which is lost here. Window's moved around. Here it is. OK. So to run it, all you have to do is dot slash refine in the directory where it was built. And it will start the project running. And it brings up a web browser pointed at the OpenRefine interface. So the interface itself is a web interface. And what you see here is it's actually got some projects that I've already been working on. One of them is a copy of what I'm going to demo to you today. So it says demo test on it. But I'm going to actually start from the beginning, which is just creating a project. You just have to find the file, which is, I can't demo. Demo accounts, right? You open the file. It shows what's in the file. It can take all sorts of different formats. This is just a plain CSV file. And you'll see what it has on it is a number of institutions in our data with an account ID, which is an internal ID of some sort. Name, display name, which is a variant of the name, address city. So this is typical data that you might have either with institutions or people or something else that you're trying to clean up and link to stuff. So this is the project creation page. And it just gives you a chance to, if there's anything wrong with the way it's formatted, but everything worked perfectly here. So we're just going to create the project. And oh, no, update the preview. I'll create a project, yeah. And there it is. So it shows that same information in a project format. And you'll notice it has the data over here. And then it has some things over here for filtering and also an undo, redo log. We'll look at that in a minute. But basically, it keeps track of everything you do to your data. So if you do something that you don't like, you can step back and go back to the way it was earlier. And you can also export that log of changes so you can apply the same changes to other data if that is something that you need to do. All right, what we're going to do here is their very basic reconciliation, which is actually very nice because built into OpenRefine is a Wiki data reconciliation piece. So what that's doing is it's looking at the names of the things in my list here. It's actually finding what kinds of things they are in Wiki data. And it can tell these are institutions of some sort. There's some universities, high schools, libraries. I'm going to have it reconciled against instead of just one of those types. I'm going to use the whole organization, not organic chemistry organization. And we're going to reconcile them. And this, of course, depends on it'll take some time, depending on how much data you have. If you have thousands of entries, it's going to take maybe an hour to do with a dozen. It just takes a few seconds. And you'll see that it's linked up most of them with Wiki data records. In fact, on the left, you'll see it's created two filters, or facets. One with the judgment, which is a little covered by the Tuggish Spiegel here. But you'll see it says matches. Actually, it didn't match most of them. It matched five of them automatically. And then seven are not matched. We can look at the seven that are not matched here. If I just click on here. And then we can work on figuring out why they didn't match and finding a Wiki data match for them. So first one here is a Max Planck thing. And it's got a candidate match, which is library. But we actually want to match to the real institution, not to the library. So I'm going to click on here to bring up the link from Wiki data. And you'll see it says it's part of this thing. So we'll just take the QID here. And so you can either match every cell that has that same name, or just match one cell. I typically just match the one cell because I know that my data usually has just one record for institutions. But if you have many records for the same institution, you can do it all at once here with the same label. And so that's going to match that. So now we have Max Planck Institute linked in that case. And so there's some other examples here. Let's see. Actually, I'm going to cancel this right now. We're going to cut and paste. So this is something that's in an institute. So the name didn't match partly because of the extra bits in the name that probably aren't part of the, let's see, I think it was Mechanics, right? OK, we want the institute. And there it is. So you have to know something about the thing that you're trying to link to. But if you can find it, you can match it. You can see that with just a piece of the name in Wikidata. OK, there's some other ones. So sometimes you're not going to be able to find enough in your data to match things. The Logan High School here, my data set just has absolutely nothing, no other information about it, just the name. And there's plenty of them in Wikidata. So it's obviously not going to match. Let's see. University of Latvia. This isn't the odd one. But the reason it didn't quite match up is because the name in my data says the University of Latvia. But it has actually University of Latvia. It's clearly the right thing there. So we're just going to link that up. You can just click on a checkbox to link it directly. So this would have matched one of the wonderful things about Wikidata, of course, is the multilinguality. So we can look up for this Polish name. And you'll find that there it is, the Institute of Mathematics of the Polish Academy of Sciences on the first thing there. That's clearly the one we want. So we can link that up. All right, so we've linked most of them. And I could do the other ones, too. The FHI for integrita sheltering in is just misspelled in our records. It should be an extra R in that name. And so you can do a search on Wikidata to find that's a Fraunhofer Institute. So anyway, so those are the ways you can match. So the interesting thing here is to, so let's go look at the ones that are now matched. We're going to look at all the matched ones. And OpenFind actually comes with a facility to add data to your data set based on Wikidata information. So I can go here, again, to these names that have been reconciled. I'm going to remove the Logan High School one as well. Wait. I don't know that one. Wait a second. Oh, yeah, I'm going to include these ones. Here we go. Can't quite read my screen here. OK, we're going to add columns based on reconciled data. So we can do that with edit column, add columns based on reconciled data. And then it comes up with a list of, it's actually come up with a list of typical Wikidata properties that are associated with these institutions in my list here, or maybe associated with these types of institutions, these types of Wikidata entities. So let's add headquarters location if we have that. We can add located in the administrative. Some of them have that instead of headquarters location. And it shows up. So you can add multiple columns at once. You can also add things based on the labels in Wikipedia. So P17 is country. You can actually follow paths. P17, P2A gives the three letter country code. And that's the same code as normally my data set was using there, so we can add these. If there's external identifiers you want to add or other things you can just add them to the list. And it's going to pull those in as new columns in my data set. So I can export this data and compare things. So if we do a, I can export it, but I can also compare things right here. So I can do a comparison between this country code and the country code that was in my original data and see if they match. Sorry, add a facet, custom facet. And this has a whole expression language, which is pretty similar to some programming languages. And it has a help that helps you figure out how to do this. But here I'm just doing a comparison between the value on that column and the value in the country column, which was the original one. And it shows three of them did not match. So why didn't they match? Three different reasons. One is because even though our data set was supposedly using the three-letter country codes, in fact, for the United States they put in the United States instead of USA. Second one is somebody seems to be very confused about what the country code is for Switzerland. It's not Swaziland, I think is what that is. And then the third one is there seems to be some conflict between, so somehow it has University of Latvia, but it also has it in Warsaw, Poland. So that might be something to look into the data and see what's going on there or figure out what you're doing. One of the nice things about OpenRefine, though, is you can edit your own data as well and fix up problems like this. So I can just change USA, United States to USA, and I can apply it to all the data in my spreadsheet, in my data, and then that problem goes away. So that is actually pretty much everything I had. I wanted to show. I'm going to remove these filters and just go back to the original data here and see if you guys had any questions. That's it. Nobody. It's very easy. There's a lot more stuff. I should mention that Antonin Delpoit, who is a Wikidata contributor, couldn't be here, but he's worked a lot on this, and he's actually working on an export to Quick Statements. So you could take data like this, run it through OpenRefine, link it to Wikidata items, and then export updates. So like if you had an ID in here that wasn't in Wikidata, you could export that as a Quick Statements list and update Wikidata almost directly from OpenRefine. So it's very nice and integrated. In the meantime, I've got a question for you in the room. Would I already use it or at least heard about it OpenRefine? Yeah, not bad. Good. I have a question when you are searching for the name. You have there just labels and no descriptions. So it's sometimes very uneasy. For example, if you can choose new match, you can try. Yeah, for example, this open. Right, this one here. Yeah. For example. Oh, sorry. You can write there, for example, PROC, and you will see what is problem. Yeah, it doesn't show descriptions. Yeah, no, that is true. That would probably be nice. So that depends on this is using this Wikidata reconciliation for OpenRefine interface. I think I'm not sure who wrote that. Anybody in this room? But that's an API that obviously can be modified to do better on that kind of thing, I guess. And you can write those, if you have another data source or another Wikibase instance, you could write a reconciliation interface for that and link to that instead of Wikidata. So regarding this reconciliation, so I see it works quite well for names, but it will work like for any properties. For example, I have a database of some stuff I want to upload to Wikidata, but so it looks perfect to match, for example, IDs that we have currently on Wikidata from some database and our database. So we can clearly and super quickly remove all the data that we already have on Wikidata. So we make sure that we don't have any duplicates, right? Yes. Yeah, one of the things I forgot to show you when you're doing the reconcile. So what I've done here is I've stepped back to before I did reconciliation, but when you bring this up, you can actually add additional properties. So you see the other things in here I could have added, I think I can add p17 slash p298 here, for example. And it would try to reconcile that with the country code to make sure that it only matched where that was matching or if I had another ID and that was an ID in Wikidata, it would try to match that ID directly to the Wikidata records. So you can enhance the matching process and adjust it a bit. Anyway, it's worth using. All right. Any more questions? You're sure? Last word? One, two, three. Catch me later if you need to. All right. Thank you. What do you mean?