 Good morning. My name is Sarah Timer and I am the description and discovery strategy librarian at the University of New Hampshire. So today I am going to be talking about our project to use WikiData as link data to enhance our collection analysis. The project had two goals. The first was to determine what WikiData properties were coded consistently enough to be of use if any of them were. And the second one was if there were some with a considerable amount of coding what would that tell us about our test collection. So the characteristics of the test collection we just wanted to concentrate on text to begin with. We did not want to look at the multimedia, the audio, video and we wanted the test collection to be of a decent size. We didn't want a gigantic one that was going to take up too much time and we didn't want one that was too small. We wanted one that would give us a significant amount of results. So we chose the philosophy collection based on call number and we divided that collection up into four subgroups. Books not in English, books that are in English now but were not originally in English. Those would be the translations, the physical books that are in English and everything electronic just got put in one category. We 100% realized that we could have divided up electronic into English not English and translations but we thought we could do that at a later date. We also decided to concentrate just on the names that were in the author field that's the 100 field and in the subject field that's in the 600 field. We did not include names that are also in the 700 field so we're not looking at the translators or additional authors or any of the other people who also were in the 700 field. We're not dealing with that in this project. So there are a lot of wiki data properties that describe people in wiki data but we decided just to concentrate on a few. For the sake of this project we just concentrated on the sex, gender, native language, country of citizenship, ethnic group and religion or worldview. It could be that in the future we identify others that would also and then experiment with those but we did not do that at this point in time. So the basic overview of our process was after we created the sets I imported the data into open refine. In open refine I reconciled the names using wiki data and once reconciled I added columns based on the reconciled values. So I have a slide where we are going to quickly go over what that process looks like. These are 14 names that came out of one of the philosophy collections but I didn't want to have the whole sheet because I thought that was going to take up too much time so the first thing we're going to do is just reconcile those names so you just go down to reconcile start reconciling. You need to pick wiki data, tell it to start reconciling and it'll reconcile those names and it's going to go pretty quickly because there weren't that many names. So now you can see that some reconciled automatically and others didn't know exactly who to reconcile it to so there's going to be some manual work involved in this process. So the first name was a philosopher the second one is a dentist so then you just tell it to reconcile on the first name. It's going to be similar all the way down where the first one is a philosopher and the second one are your things that are less certain but it looks pretty sure that the first name that you've got is truly that the person that you want because they are all saying that they are philosophers and you know this is the philosophy collection so it's probably pretty safe picking the philosopher as the person that you want to reconcile to. So once everybody has been reconciled now you get to create more columns based on those reconciled names so you're going to go down to edit column create columns and now you're going to choose which properties from wiki data you want to add and we're going to start with sex and gender and you can see you get the preview of what that's going to look like and you could see even from this view that sex or gender was a very commonly coded field and now we are adding other fields that we think could be interesting we added ethnic group we added native language and you can just see from that list of suggested properties on the left there are a lot of properties that I did not even try not they would tell me that much that I want to know for this project but just in general there's a lot of information about people that you could get from wiki data if that information is coded so after I finish adding all the properties that I want to experiment with I just click on okay and those fields get added so now you can see that this is what that looks like you can see that martin buber has a lot of languages spoken so it is possible to have multiple values in the same field but basically that's what the process looks like that's what I wanted you to get out of that slide so how frequently were names actually reconciled and here when I use the word reconcile that just means that that they were names that we could identify a quick match for if there were no matches or there were too many matches and it wasn't really obvious very quickly who the right match was we did not try and spend time on those names because there were just too many and we did not have enough time to be spending time on that so it certainly could be that the name is in wiki data and we just did not spend the time to find it but you can see even just from our quick manual effort and the automated reconciliation that 97% of the authors for translations were reconciled and even in the electronic it's at 66% so it's pretty it's a pretty significant difference but still it's more than half as far as the electronic collection goes I I want to make sure to say that in our collection policy we are now preferring electronic so any of our new titles are going to be electronic and that has been the case for at least more than five years so the physical titles tend to be older anything that's new is going to be electronic of course there are also classic old titles that are electronic too but definitely anything that's new is going to be electronic so you can see there as far as author reconciliation right we're at a high of 97 and we're at a low of 66 with an average of about 75 as far as the subject names go that was pretty consistent across the board as well and they were all at 99% so an overwhelming number of subject names actually were reconciled which makes sense if you think that somebody wrote a polished monograph about this person this person then has a record in wiki data yes that makes sense and it could be that if we were truly trying to get that up to 100% if we went in manually I I wonder if we couldn't find the wiki data record for that other you know tenth of a percent so how about the frequency of wiki data properties for the authors you can see that sex gender was the most populated field for authors and it was very well populated we have the low of 95% for electronic and the high of 99% for translations into English the second most populated field with citizenship or nationality which also had a low of 65 for the electronic but you know 75, 90, 93 for anything that's physical much less frequently populated were those other three categories the religion the native language and the ethnic group with the ethnic group of electronic down at like four and four percent and for physical English it's down to five percent so not a commonly coded wiki data property but sex gender and citizenship nationality pretty commonly coded so what does that look like for subjects for translations sex and gender was at 100% and it's over 90% for electronic and physical English it's at 83% for not English language I think that's very strange and I started to look into why that is but I still need to to do further analysis in there I didn't have time to go back to it and truly look more carefully at why not non-English language authors don't have the sex gender coded they all are pretty high with citizenship and nationality that's definitely the second most commonly coded property and with the religion world do native language and ethnic group all less than 50% and down for 2% with ethnic group in non-English language material so much less frequently coded and something that I'm not going to I'm not going to be analyzing that data right now because they're just it's not frequently populated enough so with sex and gender when you look at sex and gender what what is that going to tell us about our authors and subjects in the philosophy collection and it tells you what you think it was going to tell you that between 95 91 and 95 percent of the authors are male with only one person in the entire collection coded as something other than male or female and even as subjects it's a very high number of men as subjects so most of the philosophy texts either physical or electronic are written by men and or written about men so that's not that's not real news but now we have numbers to actually demonstrate that fact so how about their nationality citizenship well we have not finished analyzing that part of the data in part because nationalities are a little bit more complicated than the sex gender because country names have changed the borders have changed and people can have multiple citizenship it seems like especially during world war two there were people who had german citizenship and american citizenship so it's people are going to be going in multiple categories so we needed to think about how we wanted to indicate that so what we started to do was to assign these people the authors and subjects to continents instead of two countries i think that's going to work out well i also think that from the numbers that we have right now it's going to be sort of as you might be able to predict the same way that most of the authors and subjects were men most of the authors and subjects are going to be either from europe or from the united states that's going to be you know it's going to be close to like 90 percent are are us or europe with asia africa all the other continents are are a long way behind and it's going to be dramatic when we see it but i just don't have the numbers yet to to show it to you so what are our issues and next steps we have those names that were unreconciled would anything change if we were able to reconcile them how many of them truly have a record in wiki data and the people that don't have a record in wiki data um is there a common trait with them or should we be adding those names in because um it could be that the reason that they're not in wiki data is that they're a member of an underrepresented group and so maybe that's an indication that that's something that we need to do the our next steps however are it's going to be to finish analyzing that geographic data and that shouldn't take that long now that we've we're going to take the continent approach we are also going to create a primo view to use to create facets to be able to see whether that view is going to help with collection analysis for our collection strategy librarian and our subject librarians to see whether they're going to find that to be a useful tool for them and this is um what that's going to look like there we go um this is a primo view we're an alma primo library and you can see on the right side of your screen under the filter results there is the creator's country of citizenships right there there's the sex or gender of the creator you can see there there's the one transgender female that got coded you can see even farther down there's the author's ethnic group that i threw in and i think i'm going to take out because it's not a frequently enough coded field i don't want people to get confused by that um so i actually i need to delete that but this is what um this is what that view would look like and here if you had the sex you could divide it into you know resource type or location or library or even you could see whether it was electronic or physical there are a lot of things that you could do but we're gonna need to to look further into that to see whether that's something that would be of interest to other people so that is where our project is right now i'm going to stop the share here's my contact information if you would like to get in contact with me so um the project is continuing and we might be looking at some additional properties to see whether there are properties that would um do us more good than the properties we chose but um if you have any questions just let me know and thank you