 Hallo iedereen, mijn propozel is om Wikidata te gebruiken als een universale library-tazaurus, mainly voor naam-autorities, en dat het even in dit environment wel heel provocatief is, denk ik. En het is een propozel op m'n behalve, het is zeker niet... Ik moet dat stressen, het is zeker niet de vormlijke politie van de National Library of Netherlands. Dit is een overvuur van m'n talk. Ik zal eerst de ideeën en de motivatie spelen. Dan zal ik laten zien hoe we Wikidata gebruiken in onze historische newspapers. En later zal ik over verschillende aspecten en verschillende manieren gebruiken. En dan wat prozen, wel niet prozen en cons, maar prozen en objecties. Objecties die ik hier van andere mensen heet. En dan zal ik wat conclusies geven. Nou, eerst, waarom gebruiken we dit? Nou, we gebruiken het voor een unieke identificatie van entities. Autoritiescontrol. We gebruiken het om contextinformatie over een entity te geven. En we kunnen het gebruiken om een semantische search te enableen, based on de entitiesproperies. En voor welke reden zouden we linken met Wikidata? Well, Wikidata can serve as knowledge base and has helped to a lot of other knowledge databases. And it links identifiers to each other. So why not combine it? So the idea is to adopt Wikidata as universal disorders. And the reason for it is that libraries and other institutions from different disciplines increasingly want to get connected to each other. And connecting to a central hub is more efficient than connecting everything to everything. And using the same identifier is even more efficient than linking resources. And besides that, using Wikidata identifier is much more efficient than inventing yet another identifier. And libraries might want to share the responsibility for a common disorder, and in this case the Wikidata is common disorders. So we need to create trusted links, and this is the current situation. You see in the left corner you see a small view of a bibliographic record from the National Library of the Netherlands. You see there a link to Wikidata to the Tissaures, and the Tissaures links to FAYAV, and FAYAV links to ISNI, and ISNI links to Wikidata. And so we can get the link to Wikidata in our records. But in many situations we don't have those links and they have to be created, especially in case of text it's quite difficult. En I will come back to that later when I talk about historical newspapers. So this is the current situation where the library catalogue records link to Tissaures, and the Tissaures can link to Wikidata and to other Tissaures. This is the situation, the first step which will be a lot better when we have Wikidata as central hub. So everything is connected through Wikidata. And this is the situation that I want to propose that Wikidata is used directly as Tissaures. And what needs to be done is... Here you see a few examples of how, for example in this case Einstein as a creator can be identified in different bibliographic records from different libraries. In this case the National Library of the Netherlands, the British Library, the BNF, the Library of Congress, and they all use different identifiers for the same entity. And the proposal is, in fact, bring the Wikidata identifier to the surface and index those Wikidata identifier so that you can search for them. And bringing to the surface means that they don't have to be displayed to the user, but they have to be available in such a way that, for example, users can use it in microformats or something like that and can search for the identifier. So this is in fact an overview of what I still consider Mission Impossible, that we have different national libraries and the situation on the left is that they link to their own Tissaures and the situation on the right is that they all link to Wikidata directly. And the reason for the extra motivation is to do that is that it lowers barriers. Instead of having a number of unique identifiers, we have a globally unique identifier and we minimize the number of hubs to link from one identifier to the other we try to minimize the number of variations for identical queries in different databases and we want to minimize the number of required knowledge to access those different databases or catalogs. And sharing a global identifier makes it easier for institutions to connect without dramatically changing their infrastructure and I will come back to that later. En in general we standardise everything. Protocols, metadata formats, everything we standardise. Why not the resource identifier? So now I come to what was actually the motivating use case. The KB is not a knowledge database, but it's concoct bibliotheek. The historical newspapers. What you see here is on the left we have Davidpedia, Wikidata, FIAAF and we indexed the relevant properties that we need for linking in a solar index. On the other side you have the newspaper article and we have an enrichment process which takes the article, dus named entity recognition, get the entities and search the entities in that solar database with all the relevant properties and then find the best candidates. We have a disambigration process for it which makes use of machine learning techniques and we store the resource identifiers of that entity together with the article identifier in an enrichment database. This is the infrastructure. As we use it on the left it's the infrastructure that we use in the National Library and we make use of a high performance cloud computing cloud for the time consuming disambigration process. Because it takes a long time to process all the articles so what we do is we increase, we improve the disambigration algorithm and instead of starting again from the beginning we just go on, loop through all the newspaper articles and at the end we start again at the beginning en we have enriched all the articles with different quality or different confidence and we can recognize the enrichment because the name of the algorithm that is being used is part of the enrichment we know that an enrichment or a linked entity is the result of an old algorithm and so here you see in the steps that we make in improving the algorithm it's not really relevant for now but it's interesting to see that we constantly are improving the confidence level of the enrichment now how we use it we harvest all the articles then we take all the enrichment that we have stored in the enrichment database and we index them together in again a new index that we use for the newspapers in those index we index the text but also for example the wikidata identifier and now a user can do a search and in some cases the search can be sent to wikidata as a sparkle query and the results are wikidata identifier because we indexed all the wikidata identifiers they can be searched in that index so here is the result of a sparkle query entered in a research portal that we use and that sparkle query is to search for all articles mentioning members of the parliament that are not born in the Netherlands so you see below the sparkle query dat belongs to it and with a small change and in fact that small change I consider that as increasing the barrier to use it with a small change we can also do the same query almost the same query in the library catalog so you see all books that have an author are about as a subject a member of parliament that was not born in the Netherlands of course we cannot offer we cannot let users we cannot ask from users to enter a sparkle query that's impossible everybody agrees I think so what we do is that we have the option to enter a query term between square brackets and then the software behind it there's a best guess so in this case the newspapers are Dutch so this is also Dutch but the query is action movie and Roman emperor you can also enter that Roman emperor in English and then the software will find the best relation for Roman emperor and in this case it's quite simple it just searches for all the Roman emperors and if you do that without just pure without the action movie you will get I think half a million of articles that contain one or more Roman emperors but in this case and there are a number collections and only in the newspapers plus collection we have indexed the Wikidata identifier so if you are going to try it yourself choose that collection now this is a list of search results and we just select one click on it and then we get the article presented here and what you see on top are all the named entities that are linked to Wikipedia, Wikidata in most cases Wikidata in some cases if it's not in Wikidata it's only Wikipedia or DbPedia and you can click on that to get more information to get context information and then you see a picture and the option to to get more context information from different resource databases like DbPedia, Fiat, etc. but you can also ask directly for more information and that information is obtained from Wikidata and now you see a number of properties and in this case the red arrow is at the field spouse Elizabeth Taylor and by clicking on the field name you can search in all the articles in that same property so if you click on the field name if you click on Elizabeth Taylor you find just articles about Elizabeth Taylor but if you click on the field name you find all articles about everyone who has been married with Elizabeth Taylor so this is the search result at the end I will give some more examples in a life situation so I will now tell something about the different approaches environment and the coverage if we use not every institution and not every user has the same infrastructure available or the resources for doing all kinds of advanced things so in the first case you see above the situation where you have a sparkle endpoint for your local library and then you can do a federated query the second case you don't have a sparkle endpoint but you just have indexed the Wikidata identifier in your normal existing infrastructure and then you can do the same as what I showed with the newspaper articles search the Wikidata identifiers where it means of a sparkle query and use them to search in your local environment where you have indexed them and then the search situation is that you use the Wikidata identifiers in your local library catalog records and when you have a record displayed then you can use the identifier to create a link in other libraries that support this same ID so when other libraries also support the ID of using the Wikidata identifier for identification of entities then you can use that same identifier to create a query and it's possible to have different mixtures of the identifiers in the beginning you will probably have at least we at the Royal Library we have a local tazaurus and in some cases identifiers to well not yet to Wikidata now in this case we have a local tazaurus and the Wikidata has external links to our tazaurus in some items in the sorry, I said it wrong some items in the local tazaurus can contain a link to Wikidata that's not the case now, but that's the next step and you can also use items in bibliographic records that link to either Wikidata or to the local tazaurus, you can mix them and in the end we will have all items in bibliographic records linked to Wikidata and the idea is not to to get rid of the local tazaurus completely it can be kept for administrative purposes as long as the Wikidata identifier is on the surface and usable for external parties so this is the current coverage you see here the newspaper entities the KB tazaurus the Tazaurus of the National Library and Wikidata and at this moment we have only a fraction of the entities in the tazaurus linked to Wikidata or Wikidata linked to the tazaurus and we have ten times as much named entities in the newspapers that link to Wikidata that was a simple overview, in fact we have more collections we have more fragmented, so we have different collections we have different combinations of items that exist either in one or in two or three collections so this is the path that I want to propose this is the current situation we have a number of Wikidata records that link to our tazaurus at this moment about 5% what we can do is use that 5% to have the bibliographic record link to Wikidata directly but for the other 95% you still have to link to the local tazaurus but by a match and merge process we can try to get a level of 100% and in the end all bibliographic records can link to Wikidata directly the local tazaurus is only kept for administrative purposes so here it said pros and cons that should be pros and objections so the advantages of using Wikidata as tazaurus is that it lowers the barrier to connect to different databases it doesn't always require an advanced infrastructure to benefit from it it has less maintenance because it's maintained by Wikidata and it's volunteers and it covers more domains and it has a potential richer set of properties available the objections that are here are one is that libraries perceive it as losing control when authorized users may change items there is a danger of vandalism different organizations or countries may have a different view on a specific item because of political reasons and things like that there is a risk of duplicate entries for new items when you share the responsibility for Wikidata and sometimes I hear well what happens if Wikidata appears the answer is simple we still have the identifiers and that's what it was all about for a part so summarizing Wikidata can serve in my view as a universal library tazaurus using Wikidata as a single universal tazaurus facilitates identification of entities across organizations and we can share properties replacing the tazaurus identifiers imbibographic records can be done gradually it doesn't have to be in one go and when the transition is complete you can still keep the tazaurus for administrative purposes but not for bringing the identifier to the surface external parties do not need the identifier the link to the local tazaurus and use of Wikidata identifiers is not restricted to sparkle queries identifiers can also be used en indexed in conventional queries so well I have taken all my time but I can show a live demo of using it but I first do the questions just one or two questions excuse me sometimes the authorities are used as references for Wikidata or Wikipedia if you do that at the end who will be the authority in the beginning I talked about shared responsibility so libraries share the responsibility and have a shared authority organization the role of the librarians should redefine in order to keep control of the responsibility about the real information but not the tool I said in the beginning it's quite provocative last question sorry as I understand your proposal the actual querying would only compromise possibly the indexing phase for an end user facing service I really have a problem to rely on centralized services you have to have a very good service level provided and would you do you see in this service level problem any problems if we would follow through with that what would be the implication for the user if for instance Wikidata doesn't disappear but has a problem an outage for several hours or stuff like that it's mainly about the concept the technical infrastructure that's behind it doesn't have to be one central index it can be copied there can be mirrors it doesn't have to be at one place but it's mainly having one type of query in a distributed environment so if one server fails then there are other servers and you can do it also locally you can have a local copy of Wikidata but it's mainly the identification that's what it's about dat is een vraag heel snel had je contact met andere Nationaal Libraries die hetzelfde idee hebben want ik denk dat het interessant is ik dacht dat ik eerst het uit hier probeer want Nationaal Libraries zijn niet erg willen om zo'n idee te adopten niet nog steeds dat is geen vraag sorry, geen meer vraag je moet het later praten sorry bedankt kijk naar de slide daar zijn we op de slide bedankt