 All right, good morning everybody, my name is Dario, I run research at the Wikimedia Foundation and I'd like to welcome you to this ground-backed talk about site agents in Wikipedia. I'm thrilled to have an extraordinary guest today, Jack Bilder Crossref, he's Director of Strategic Initiatives at Crossref. For those of you who don't know or cross-reviewed, I'm sure Jack is going to give you plenty of context on this, but if you see these meaningless sequences of characters with click-on when you want to access the product paper, you can blame this guy. So we're going to talk about the interface between Wikimedia and Crossref and talk about site agents and data metadata. We're going to have two presentations. Beautiful logistics, there's a live stream you can follow up on YouTube and for live conversation you can join the IOC channel with Wikimedia office where Abby will be moderating. So two presentations, we're going to have a discussion at the end of this. And with that, I'll give you Jack Bilder. Okay, thanks. So I should probably start a little bit about explaining something about that character set. That Garry mentioned and also probably a little bit about myself. Garry mentioned I'm the Director of Strategic Initiatives at Crossref and that's really a great news way of saying I can do stuff. So I'm kind of the art director there. And a bunch of the things that have come out at Crossref, some of which you might be familiar with, some of them are things like Crosscheck, which is a plagiarism checking system that you have a scholar publishing industry fund draft or fund identifiers, which are used heavily for trying to identify research that comes under certain funding branches. And the thing that you may be seeing a lot more of is ORCID, which actually also came out at Crossref and my group and then spun out as a separate organization. So these are the kinds of things that we work on. Largely, I think if you were to characterize stuff that we work on, it's infrastructure, scholarly infrastructure. And ideally, one of the things that is characteristic of infrastructure is that it's like this, right? That is that when it's working, you don't notice it. It's only when it breaks that you actually notice it. It's only when things are incompatible that you notice it. So DOIs, to the degree that you don't notice them, consider that success. Having said that, that leaves us with a kind of difficult situation, which is that we often have to explain a little bit about what they do because it's not immediately obvious what they do. So some background with Crossref, we're a 5,000-member membership organization. We're a non-profit. And generally, our members are scholarly publishers, but we actually define publishers with a small p. That is, it's not necessary that they be professional publishers. You've got lots of members who, for instance, have publishing operations but don't consider themselves to be publishers. So places like the World Bank, OECD, the International Monetary Fund, like that. We don't discriminate on business models. We support all contents types, all disciplines. And we also have a lot of library affiliates and things like that, during Crossref as well. And what we do, and our major claim to fame was this DOI. DOI stands for Digital Object Identifier. It's the identifier that's digital. It's not necessarily the object that needs to be digital. That's the object that's being identified. And these things, as Dario mentioned, are kind of weird or fake string, but they have a particular function in scholarly communication. And it's this. In the late 1990s, early 2000s, when publishers were putting content online, scholarly publishers were putting content online. They realized immediately that, of course, one of the most useful things that researchers could do would be to be able to click on references where the reference was available online and be taken automatically to that reference. It seems obvious, but there were some logistical problems actually doing not the least of which is that there are thousands of publishers and you had to have some sort of idea of what the URL structure was going to be on the publisher's sites. And then, of course, if the publishers changed that URL structure or did anything weird, a lot of those links would break. And this is actually the major problem Crossref is trying to address. And it may seem like an easy thing to do. I mean, after all, one thing that you could do is tell publishers to stop breaking things, right? Update your URLs when you want to play your websites. And that works for a class of broken links, and actually probably for the vast majority of broken links. But that only addresses one issue of what happens when things typically broke break. That is when an organization is too lazy, you know, they go and they change the structure of their website. They still control the domain, but they don't create redirects to the old URLs. That's by far the most common source of broken links. But there's another source of broken links that's a lot harder to deal with, and that's when this part of the URL changes. That is when an organization renames itself or splits into two or merges or does something like that, then you might not have control of the URLs anymore. And so you need some mechanism in which that actually support things still, even when these situations occur. And you might think that these situations are uncommon, and you might think that they're particularly uncommon for, you know, publishers and government organizations, the kinds of entities that participate in Crossref. But there are a lot of reasons that these things can change. It's not just that things get sold, it's, you know, vacation, transfer, poverty, forgetfulness. Increasingly, as you see people using URIs as personal identifiers, sometimes that means that you no longer control the domain that you control at one point. And as I say, there are a lot of reasons that things break, and if you just go and, you know, a lot of people think, oh well, university links aren't going to break in general. Government links aren't going to break. Actually government links have highest rates of, it's almost any kind of organization, which is kind of not surprising if you think about it, because if the government comes in, what they want to do is sort of obliterate the history of the old government so they rearrange all the websites and they don't really have an interest in that sort of thing. You know, countries change name, you know, all of these things happen. And that's when the domain name might actually change. So these are the things that we're trying to address. And particularly we're trying to address them, well, one reason, of course, is because links are important for folks, just like they are in the lab. But more important in the case of scholarly articles, they represent the evidence record, the scholarly citation, right. These are, if this goes away and no longer see the evidence for the claims to remain in the scholarly articles, and in addition to that, I might not know who is making those claims. So this is really important, particularly in the case of research. So we really had to do it. Now the mechanism that DOIs use is not a technically difficult problem to solve. It's a more socially difficult problem to solve. But what DOIs do is they give you this, they work very much like a card catalog, right. If you go to a library, a physical library, in the old days you would go and if you looked in the card catalog, they would not tell you, the card catalog would not tell you a book was, you know, on the third floor on the fifth shelf from the back. It would give you a call number. And then the call numbers would be mapped to physical locations. And that meant that if they rearranged the library or reshelfed books, they didn't have to update the card catalog, they could just change the mapping of the call numbers. And DOIs work exactly the same way. That horrible opaque string that Mario mentioned earlier is in fact a URL that is a pointer. It goes and says, okay, where physically does this URL exist now? So if a publisher changes their website or if they get acquired or if they change a domain name or something like that, all they have to do is update those pointers and all the old links using the DOIs will continue to work. So this allows us to persist links, you know, and this is an important term, persist. We don't claim that they're going to be permanent. Persist is sort of a synonym more for stubbornness. Say they're stubborn links, we will, you know, do our best to update them, we will contact our members, we will do things like find them and turn off their ability to posit if they don't update things. But by and large they do because of course this is a great source of traffic for them and they take this stuff seriously. They take scholarly references seriously. So that's the way that the system works. As I said, technology isn't important. It's a simple redirect. Almost every URL shortener on earth uses similar technology. But the thing is that they don't actually have an organization behind it and they don't have a membership model that requires the members to actually adhere to certain conventions and to actually behave in a different way. So the technology is very similar but the organization is really important. And again, this organization, you know, mediates this problem of having to do multiple bilateral agreements between what really are thousands of publishers. You know, a lot of people when they're thinking about scholarly publishers to the degree that they know them at all, think of the big ones, Elsevier, Spur, Nature, and so on and so forth. And of course there are members, but if you actually look at our membership list, it's now approaching 6,000 members and that's an awful lot of people trying to track them. So this central linking switchboard really helps. So what does this have to do with the Wikipedia? All right. Well, years ago when I first joined Crossref, one of the things I was interested in doing was understanding a little bit about how non-scholarly content was making use of scholarly literature. So, you know, blogs linking to scholarly literature were social networks linking to the scholarly literature. And at the time, one of the things that I looked at, this was about back in 2007, was I looked at Wikipedia and I downloaded a dump of Wikipedia at the time and analyzed it to see how many POIs seemed to be being referenced and also to look at references themselves to see if they might, even if they didn't have POIs, whether they looked like they were scholarly. And at the time the answer was there were a few, but there were a lot. And so I sort of promised that I'd back to the problem later on because it also looked like that things were changing and things were growing. And we did come back to the problem. Somebody from my group, we went and we took the referral logs for the DOI system. That is, the things that show us every time somebody clicks on a DOI and where it redirects to. And we looked at those referral logs to analyze who was driving traffic to our member sites. And we were interested broadly in just understanding particularly how traffic was being driven from non-publisher sites, because we have a pretty good idea of how much the publishers themselves drive. And when we did that, we were really, really surprised because we learned that Wikipedia is the fifth largest referrer of DOIs to the scholarly literature in the world. Now, I usually use weasel words around this because the actual referrer logs are a little bit hard to interpret. There's some a lot of noise in them, but I'm pretty confident that this is the case now after having looked at the logs again. And I'm actually, I think that the potential here is actually an understatement because the other thing that we've done, and I know Dario has done this as well, is we've looked at Wikipedia articles to determine A, how many DOIs there are in the references, and how many of those DOIs, how many are linked, and not all DOIs that are in the references are linked. And then secondly, not all references to the scholarly literature use DOIs. And that's largely because they're sort of a, you know, something that you only know about if you're in the trade and you wouldn't necessarily know if you weren't a researcher that should add a DOI. And even if you were a researcher, you might not know how to look up the DOI and it should link with it. So we think that there's probably a date, there are probably a lot more references to the scholarly literature, probably, and even traffic suggests. So what this translates to is that back in 2013, when we first, when we last analyzed this data, I tried to get a refresh of the data to present today, but it's still being processed, so I don't have the numbers. But back then, it was about 20,000 to 30,000 girls a day. It was increasing by about 2,000 girls a day for an eight month period that we analyzed. And the top 10 subdomains that we saw coming in were naturally the English Wikipedia, but then also a lot of traffic from some of the other local language. So this really made it clear that we wanted to engage more closely with the Wikipedia and to see what we could do to A, understand the, you know, the traffic that was coming from there, and B, also see if we could actually help the editors of Wikipedia articles to link persistently to the scholarly literature. And so I had run across Daniel Michin, some of you might know a number of times, a number of conferences, and we'd been talking about this for a long time, and he suggested that we put together sort of a little group of, but I think he termed media ambassadors because they weren't comedians in residence quite. They were people who were interested in scholarly literature. Back in London at Wikimedia 2014, we first met and talked about some of the things that we might be able to do. The initial group was Daniel, Max Sennett, Max, Samilian Klein, and Bertha Howard. But then a few of them got distracted by other things, and ultimately, last year, we got together with Max, Samilian Klein, and Anthony DeFranco to work out on a project specifically to take advantage of something that had just been launched by on the Wikipedia, which is a live stream of edits, and to see what we could actually do with that stream. And so we worked with them, and they prototyped the system that they called Cositis, that looked at the stream and tried to pull out DOIs. And a few, I think a few months later, sometime in March, we announced that we had a live Wikipedia stream of Wikipedia edits, showing things like in real time, whenever anybody cited or critically unsighted a DOI in any of the Wikipedia. So if you go right now, I hope it's live, to wikipedia.labs.crossriff.org, you'll see basically this live update of Wikipedia edits, and particularly articles that are citing the scholarly literature, and as I said, sometimes unsighting as well. And this has been a really useful exercise for us, because it's really gotten our members and scholarly publishers interested in traffic that's being driven to them from outside the scholarly literature. It's really highlighted. This is a source of a lot of interest, and so they've really been encouraging us to actually further explore what we can do to encourage people to link using persistent identifiers indeed for us to monitor this and help build tools that help people not just use DOIs, but use other kinds of persistent identifiers like PMIDs and work with them and stuff like that. And most recently, just this week, again, Joe Woss, who works for me in Oxford in the UK, decided that he was going to take advantage of the fact that the Raspberry Pi Zero had come out, and he went out and he managed to snag one of the last ones before they sold out. And at the beginning of the week, he posted this little experiment that he conducted where he put together a Raspberry Pi, basically a real-time framed Raspberry Pi Zero that's showing us how many DOI citation events or unscitation events are occurring. So now we proudly have this on our wall in our office at Oxford next to the refrigerator. So if that doesn't show you how important we think this stuff is, I'm not doing nothing. And I'll just note that the other thing, the thing that is hanging below here is a collection of telephone adapters that I put together in the old days when I used to travel a lot, and I used to have to carry all of those things around with me all the time if there's a better illustration of benefits of standards. I don't know what it is, but that's also hanging in our office. So ultimately, I think, again, this gets back to an overall goal that we have at Crossref, which is to understand how scholarly sources are engaging in scholarly literature. We think this is important for us to monitor. Increasingly, it's important for researchers to know this, for funders to know this, for publishers to know this, to know that research, the formally published research, is having some sort of an effect that is being accessed by civilians out there, not just scholarly. And so ultimately, our goal is to feed this into a project that we'll be launching in the middle of next year called the DOI Event Tracker, which is a general purpose framework where we want to gather information about the usage of DOIs in all sorts of different sources, whether Wikipedia or Twitter or LogPost or social networking sites, so that we can build a pool of data that can be used for, amongst other things, building awareness applications, perhaps building metrics, and so on and so forth. And again, we see this as a thing that we're uniquely positioned to do because we can basically create a mechanism for collecting information for our 5,000 members from however many different social media platforms or other platforms that might be engaging literature. So we're building this tool that goes out and collects data and stores it and makes it open. And this is really critical as we collect this information. We want to make sure that people can actually use it. We don't want to compete with the organizations that are doing sort of professional stuff, general category of organizations that are building alt metrics that are doing value add services, reports, analyses, and things like that. But what we want to do is make sure that these things don't turn into the equivalent of the Thomson Reuters of our day. That is that Thomson Reuters is a big organization. Well, there are two big organizations that basically are the only organizations that have an overall view of traditional site patients in the literature. And I'm sure Daria is going to talk about that in his talk. But we want to make sure that if we're collecting new sources of information about usage, that that data is open, comparable, auditable, and portable. That is, we want to make sure that this raw data belongs to the community and that everybody can, that everybody can make use of it. So the next things that we're going to be working on and we're looking at working on and part of our goal here of visiting Dario is to actually see whether we can collaborate on some of these things. We're interested in further analysis of the referrals by subject category. We have a strong suspicion that there's, that the DOI referrals that we're seeing and is clicking on DOIs and following the literature occur in specific subject categories, possibly biomedical literature and stuff like that. We want to also start searching for directly linked references. Those people who for instance linked something on a publisher site didn't use the DOI and put the publisher URL to the article in there and see if we can map that back to the DOI and collect that kind of information. And search for unlinked scholarly references and not just articles but monographs into what all the sort of orthogonal related literature, things like patents and standards and other things that are sort of closely tied to the scholarly literature and often interact with them. We're very interested in working with people to improve citation tools that are being built for Wikimedia and for Wikipedia platform. So for instance, there's a lot of interest in collecting some of this information and making sure that references are pulled out of Wikidata and we're happy to feed Wikidata so that they've got those references so that you don't, so that not every reference is an individual string. We think that by doing this you probably will get a lot of benefits. For instance, by using some of the Crossref metadata you'll be able to do things like flag references that open access literature, flag references to literature that perhaps has been updated, that is for instance corrected or an extremist retracted or withdrawn. So we think that there are all sorts of kinds of tools that can be built off of this that will make the reference and citation process in Wikipedia sort of more useful and more dynamic. As I said, strong suspects are talking about some of those projects. So basically that really is what we've been doing with Wikipedia. I think it's an overall project of ours to look at how things interact with scholarly literature. Wikipedia is by far the largest source of referrals and contact with the scholarly literature that we've seen. I say it's like thanks right under the top sort of aggregation. If you're interested in tracking the stuff that you do, further look at our blog.crossref.org or labs.crossref.org where we post all of our experiments and talk about them and of course you can contact us directly. Joe Was is the person who's been working most directly on a lot of this kind of stuff. The person's built that tool but also a lot of it depends on the other guy in my team. Our award is the guy that Dario and others are planning on using to mine some of these references in Wikipedia and of course you can contact me if you're interested in any of this literature. I'll be working, we'll be working pretty closely with Darko. So anyway, that's it. I think I actually did fantastic. Thanks Jeff. This is the best possible introduction to my talk a bit of a lot of. Brenda, can you help me switch to the Google Docs? What we're doing is I want to say that Jeff covered one issue that is really critical in terms of the disappearance of links, link rot. Huge, huge issue if you want this network of references to remain persistent and available. I'm going to cover another aspect of issues that we have with links and sourcing. This can be the main focus of my talk today. Okay, it's working. All right, so briefly, what I do, I, the regular researcher at the Wikipedia Foundation, today I'm not going to talk about what my team is doing. I'm talking about something I'm passionate about and something I've been collaborating with for a while. So, I'm really interested in the question of how we use open knowledge and collaborative created reference materials like those that you find on Wikipedia as the entry point towards the scholar literature, right? And Cameron Nealon is the one who came out with this brilliant wording of Wikipedia as a front matter of all research. He also did this great workshop at Wikimedia in 2014. A few years before the Chronicle of Hard Education called Wikipedia the highest layer without formal vetting that represents the ideal bridge between the validated and the unvalidated web. And I found this like a pretty strong metaphor for what Wikipedia represents. So, typically when people start talking about access to scientific knowledge, the audience expects like a rant about paywalls, about how broken the publishing model is, and today instead I want to talk about technology. And so, you hear me talk about unique identifiers, you hear me talk about structured data, you hear me talk about bibliographic metadata and wiki data, and you hear me talk about goats. And the reason why I want to talk about goats today is that I want to give you an example of what I like to call the disappearance of provenance. I think it is the main topic I want to talk about and ways we have to counter this general tendency that we see today on the web. So, my goal today is threefold. I want to try and persuade you that algorithms used by search engines are undermining linking and sourcing. Otherwise, two of the fundamental penises of the open web and ultimately Wikipedia, which builds itself on the notion of sourcing and verifiability. I want to try and persuade you that wiki data is a solution to this problem. And finally, I want to introduce you to a little-known project led by the community called wiki project source metadata on wiki data, which is actually trying to address this issue. So, back to goats. As I'm sure you all know, Wikipedia has an amazingly well-resourced article about goats. They're anatomy, they're diet, they're evolutionary history, they're presence in fiction, and pop culture with dozens of references and up to 135 languages. We really have plenty of very well-resourced information about goats on Wikipedia. And somewhere in the middle of the English Wikipedia article, you will find a very inconspicuous sentence about the average lifespan of a goat with footnotes and references that you can look up to verify that statement. Now, this is what a popular search engine gives you when you search for the average goat lifespan, 15 to 18 years period. Now, let's take a look at this. This information seems to have been extracted from Wikipedia, or at least from the source that is cited by Wikipedia, but it's presented with no provenance information whatsoever. And you can try this exercise with plenty of statements, that you can think of, and they can have a quick answer of this kind. And this is not an accident. This is something that search engines are heavily investing into. I don't know if any of you have heard of the Google Knowledge Vault, make sure, hence, yes, a bunch of you. Okay. So, the Knowledge Vault is one of the most ambitious projects Google is currently working on in terms of going beyond the freebase as its semantic engine. And it is that Google can currently crunch its vast catalog of documents in its index and generate what they call confident facts. And confident facts are basically triples, like statements like the ones we just saw. They're extracted from all the sources, and they're given a given level of confidence as a function of how many sources back the statements. So, in other words, references are just used as signal to determine the confidence that Google has in the truth of these statements. They're not necessarily represented as the output of the algorithm, right? They're instrumental to generating the statements. And I think this is a trouble that we need to be aware of in direction that raises some interesting questions. Obviously, Google is not doing this for fun. This is basically the dream of the link data vision, right? Where you can have quick answers that you can generate just by looking up a pool of connected facts. And with most people shifting to mobile, this need for quick answers is becoming even more pressing. So, this is an example of what you see when you ask Siri, how long do gray hounds live? You get an answer in these cases from Wolfram Alpha, but it's pretty much the same story, right? The answer is about 11 years, no source whatsoever. And interestingly, if you go and check what Wolfram Alpha is doing, they have a big disclaimer, and they tell you that you shouldn't really expect to pinpoint any single source for that statement. They really warn you that is not what you should be doing. And the reason is the same. So, they have a bunch of sources they're using. They want you to trust the answer engine, right, as the authority for that state. Providence-free information, that's what users are looking for. They only care about the actual state, not the source backing it. So, David Weinberger made a fantastic compelling point back in 2012 when he basically called out the despicable behavior of these websites that are linking to themselves, as opposed to linking to external sources. And he called these sites a stopping point in the College of Information. And he basically argued that by taking the links out of content, this website are turning themselves into the authority for this information, right, as opposed to allowing people to verify these knowledge claims in the original context. And in a recent conversation with him, he actually reminded me that there's a war for this, and there's a beautiful precedent in history for this notion of provenance-free information. And that's the Almanac, right? So, the Almanac is basically the best example of an answer engine that doesn't give you any context about the source. And Humanity in 2015 is recreating the Almanac, thanks to algorithms. It's an amazing achievement of our civilization. So, we do have a project that is an answer engine that doesn't quite work like an Almanac, and that acts as a provenance per serving answer engine. And that's Wikipedia, right? So, Wikipedia is all about verifiability. It's not about truth. It's about providing access to reliable sources. And its reputation does come from the fact that you can look up and verify by yourself this information. That's what makes Wikipedia such an outstanding public service. That's why Wikipedia is not an authority itself. It's an entry point towards authoritative sources. And you might think that we're doing a good job, given this is basically our official mission, but things are far from being perfect. What you see here is the breakdown of statements we have on Wikidata as a function of the source that they had. And up to 80% of these statements as of last month are either using Wikipedia as their main source or have no source whatsoever. So, only roughly 20% of total statement in Wikidata do have sources and external sources that is transparently represented as a back in that statement. This is giant tech data on the quality of data that Wikidata is generating. So, how do we start tackling this problem? And I think we have an outstanding opportunity at Wikimedia to address this issue by building a human annotated collaborative repository of all citation and source data and trying to bridge the gap between information we present on our projects and the sources that back it. So, in a perfect word, where only information can be represented as a connected pool of statements, metadata about the sources and sources themselves should be part of this database. There's no reason why this should not be part of a database. And they have been countless efforts to try and build a comprehensive repository of open bibliographic data. And Crossref represents probably the most accomplished effort in the direction. But Crossref only represents some of this data. It makes only some of this data publicly available to the general public. And Crossref data, for example, is not integrated into knowledge base. It's very hard to go from a bibliographic reference and maybe the author information to the topic of that reference or to the institutions to some extent. Basically, Crossref data is a fantastic but insular source of information about sources. How do we connect this to knowledge bases that we're so good at building here at Wikimedia? So, within Wikimedia itself, there have been at least 10 years of attempts that I'm aware of at building a knowledge, sorry, a repository of bibliographic information for all the citations that we have in Wikipedia. These efforts go back to 2005. None succeeded. They predate largely Wikidata. But finally, we're here. So, finally, we have Wikidata. And I think we have a project that has the vision, the technology, the community who is geeking out about metadata and structured data, the ability to operate at scale, the right licensing model, because we're talking about mostly CC0 non-copyrightable information. And finally, and importantly, the independence. Because if we want to have a service that is not a spin-off of a publisher, it is a central repository that is not subject to commercial interest by third parties. And so, basically, we have all the conditions today. Wikidata has all we need to start building this human curator repository of all citations starting from citations using Wikipedia. And these, some of all human citations could then feed into a large ecosystem of consumer services. In terms of integration with Wikipedia and Commons, Jeff mentioned automatic services before that are measuring the impact of citations. And also, we have, of course, all of the providers of scholarly metadata and identifiers. So, how do we start building this? Well, it turns out there's a group of people in the librarian and scholarly and Wikimedia community, including myself, who got together and started figuring out how to build this. So, meet Wikiproject source metadata that you can look up on Wikidata. So, the first goal of these people was to try and determine a data model for storing source metadata in Wikidata as items. The idea is, what are the properties that we need to represent the core set of information we want to store for any given citation, like journal article properties, journal properties, book properties, author related properties and so on. And I think we did a pretty good job. We have now a solid schema that has been discussed and invented. And we start having, in Wikidata itself, plenty of publications are already stored using this schema. So, this is an example of a journal article published in PLOS ONE, one of the journals by the Public Library of Science. This is a title. You can see the license, the language in which it's written, the authors, publication date, and main subject, and even a picture of what this paper is about. So, all of this is possible today, and we already have quite a few publications in Wikidata already using this schema. But what are the actual benefits of this approach? At the end of the day, you could say, you know, we have all these references in Wikipedia already, and they're structured to some extent. This is what a template looks like in English Wikipedia, a site journal template. Now, the trouble with these approaches is that right now, all of these citations are buried in the body of an article. There's no way you can interact with them as items in a database, right? You need to parse content. It's a huge pain to try and make any sense of this. And so, people still, if they want to say something, they need to add a reference to the actual article. How, if we could instead cite by reference, as you can do in most reference managers today, how we could pull the reference from a database like Wikidata and cite it by referring to the item that exists in Wikidata and generate the appropriate reference string. So, this is possible today on many Wikis by using these templates that allow you to pull information from Wikidata. It's going to be one of the first benefits of using Wikidata as a source for this data. It's going to be a bit of a geeky technical slide, but it's one that I feel strongly about. So, Wikidata can also be the place for storing all the mappings of identifiers. And if you've never heard of an identifier mapping or authority control, you can safely skip this slide. But if you have, and if played with something like a Magnus Mix and Match tool, for example, so this tool that allows you to cross-reference Wikidata items with external catalogs, you realize how important it is to have a place where you can represent the fact that a given scholarly article that is identified on Wikidata via its own QID is the same as an article that is a DOI, public ID, etc., etc., the same for authors. So, authority control via ORCID, via VEAF, via Google scholarly ID, and you name it. So, Wikidata can become the place that holds all the mappings that describe the same objects. There's also this growing interest that Jeff mentioned in measuring the impact of citations beyond scholarly citations. So, this is an example of a popular service called Autmetric.com. And what they're doing is similar to what the DOI event tracker is doing. So, they're basically monitoring usage of scholarly citations in Wikipedia, mostly to give credit to authors, to funding bodies, to show, hey, your work is actually being cited, not just in journals, but also in the most popular esculopedia online that people are reading on a daily basis. So, to me, possibly the most interesting implication of this project is the ability to cross-reference sources, as I said, with the vast body of knowledge that we have in Wikidata. So, again, if you're using a cross-reference API, you'll be able to know something about this specific DOI and the direct metadata associated with it. By caching this data in Wikidata and cross-referring this data with all the other structure data we have in Wikidata, you can start going from a paper to its main subject, from the subject, in this case, to the taxon that is referenced in that paper, from the paper to the author, from the author to the affiliation, from the affiliation to the location, and you name it, right? So, this is basically the rabbit hole applied to structured data. And so, the next two slides are a bit of a moonshot, so this is something that's not within the scope of what we're considering is going to happen anytime soon. But once you have a source metadata in Wikidata, you can also start thinking about annotating the sources themselves, right? And so, Jeff mentioned licenses, like knowing whether a specific paper is open license or not is something that got quite a lot of attention recently. That's something the cross-reference API can provide, but you could imagine of adding other types of properties to these papers. So, for example, retraction, you can start representing links between different sources, for example, A, citing B. You can even go as far as representing the semantic type of these relations, right? There's a proposal for what is called citational apology that has been discussed for a couple of years, and you could imagine that given Wikidata allows you to design arbitrary relations, you could imagine properties allow you to describe that the source A extends the source B, or that source A is using methods or data or whatever already using B, or even express disagreement and conflicts between sources. So stuff that, as of today, is really, really hard to extract just by looking at the citation graph of the literature. And finally, once we have all of this data stored in Wikidata and easily queryable and analyzable, you can start answering some of these questions, which, unfortunately, today are virtually impossible to answer at the push of a button. Give me all the publications in pharmacology from the 90s that have been retracted. Give me all the facts that we have on Wikidata that are backed by works of physicists who graduated the given universities in the 80s. Give me all statements that are supported by articles published in the New York Times, et cetera, et cetera. So all this work that is really fundamental for the validation of this knowledge as a function of sources will suddenly become possible if we have all this data queryable in Wikidata. So I'm going to close here and say that, in sum, we have the ability of start building an answer engine that is not an almanac, an answer engine that is a preserving provenance, and we can do this by using existing technology, basically Wikipedia and the properties and data models that we already have on Wikidata. The next steps. We're talking to a bunch of people to run a pilot to try and populate Wikidata or potentially sandbox version of it. James Hayer set up an instance similar to Wikidata called LibraryBase that we can use for testing and some proxy purposes to try and see if we can get a community to import and curate this data. We need to start designing strategies for automatically importing some of this data from, for example, the Crosser API and linking them to the corresponding statements. We need to understand how we need to refine the data model. We need to represent properties. Some properties are not the correct ones. There's a big, big issue of entity disambiguation. Once you start having all these authors called John Smith with multiple IDs, the question becomes how you disambiguate them. And finally, most importantly, how can you design a system that allows you to ingest this data while preserving the human curation layer? It's probably the biggest challenge. It's easy to import all this data automatically. I think the 70 million DOIs you guys have will be relatively easy to ingest. It's not an impossible task to ingest into Wikidata. The question is, how do we do this in a way that preserves the human curation layer? And with that, I conclude here. Thanks for your attention and I'd be happy to take any questions. Jeff, you can join me here and we have a relaying of questions from IRC. It is. Hey, Dario. Yes? This is Jake. I might have missed. Is this question time? It is. Hi, Jake. Great presentation, both of you. Awesome stuff. I'm curious. Obviously, the vision here is to capture all citations ever. But our kind of shorter term or more proximate issue is just capturing all the citations that are on Wikipedia. I was wondering if you could talk about the relationship between those two stages? Yeah, that's a good question. And the short answer is I don't know how this is going to play out. What I know is that we want to start small. Like I said, we cannot just start thinking of ingesting all this data. Again, 70 million statements is not an impossible size comparing to what we could store in Wikidata. But that probably wouldn't be desirable. So I think the first step is going to be after we sandbox and test the data model, like I said, focus on one community. We have people in the chemistry and biology, genetics community, very interested in exploring this idea of a pilot. And basically start from there. And yes, I guess the first ambitious goal is to have all citations from Wikidata projects stored in Wikidata. And then the next step is going to be to put Scobos and Web of Science out of business. That's going to take a bit longer time. If there's not anyone else asking questions, I might. Absolutely. So Alex Tenson and the Wikipedia library team, one of the things that kind of brought us into those conversations looking at how libraries make recommendations to researchers doing work in their libraries and using their databases. And what we have with Wikipedia is hand curated citations that are kind of grouped by topic areas based on how categories of Wikiprojects work on Wikipedia. So we can kind of tell like if a X journal shows up in, you know, 10,000, a certain percentage of these 10,000 articles in this topic, this might be a journal to start your research on X type of novel, right, or X type of biography or whatever history. So I think the human curated element of this is really fascinating to me because we could take the human curated citations from Wikipedia and make really robust kind of research build into the library infrastructure. So I'd like you to hear you talk about a little bit more about that relationship between like making recommendations for researchers and the human bit that's so powerful of Wikipedia. Jeff, any thoughts on this? I mean, I see if this, so there was something that I was going to say and that is that one of the reasons that I'm actually interested also in getting a better, that I think we're interested in getting a better picture of where there are lots of scholarly references and where there aren't is because I think I have a suspicion particularly that the references, if we're getting lots of references being followed from certain articles, that there may be, that it may be the case that a lot of Wikipedia articles are serving a purpose that was one that has traditionally been performed by what are called review articles. And I've got the feeling that the Wikipedia might actually is starting to become an entry point for people who are doing research in this. And it sounds to me like that's kind of related to what you're talking about. You know, and if you could in fact flag those articles and say, actually these seem to be really good entry points for certain topics and so on and so forth. And also, you know, I think that that might be quite, quite useful as far as recommendations go. Yeah, the other thought that I had was related to the maintenance of, so like quality of this data. And in some cases we have author names that are, you know, misrepresented and I guess one of the assets we have in Wikipedia or in Wikipedia in general is that when we cache this information we can also revision control it. So if a human notices, you know, an incorrect spelling of an author name or in some cases their actual titles that are not correctly represented or there's a mismatch between, I guess, the official record on the publisher site and I guess one way of thinking of the human duration layer will be also to address that issue of potential errors in data. Right, and I mean Crossref has been interested in this area a lot too because one of the things that we find with things like Orchid and Crossref and things like Archive and even PMC is that none of them are set up so that other people can make assertions or corrections about the metadata easily and in our view we would like to encourage and this is part of what we're looking at with the DOI event tracking system is to allow people to put information in there about DOIs that maybe hasn't been put in by the publisher. So for instance if a funder knows that a certain DOI identifying an article that that article was about something that had been funded by them they might want to be able to make that assertion or if the researcher does that they might want to be able to make that assertion. As far as versions go we actually think that collecting all of those statements together might be useful even if they contradict each other. Of course if they don't contradict each other that gives you a bit more sense that it might be actually correct the statement but if they do contradict each other that's also useful information to know. So we're interested in sort of storing these assertions or claims as we state them so that people can harvest them and see claims from different parties about the same identifiers. It's actually a great point like the idea of having some of this data representing wiki data means that we also have top pages right so you now have an entry point that is collaboratively editable where you can have a discussion about a specific source it's centralized and it's open license and so it's a better way of saying what you would just refer to. I have a question from IRC. One is 70 millions plus probably an average of three authors so 200 million items is a rough estimate. Yeah so yeah when we talked about you know the an estimate of how many items and statements would have to be created on wiki data if you start thinking of the item itself the authors affiliation the journal we're talking about something around that order of magnitude so again it's not unthinkable given the current size of wiki data that would be by far larger than the current size of so that's not something you're considering at this stage. Right I mean the 70 odd million I think it's 77 million. Crossref has 77 million DOIs assigned to research objects the vast majority of which are articles and so we have bibliographic and non-bibliographic information for that. The thing that in your case that you have to be considering is of course the number of relationships between those 77 million items and that's going to be many hundreds of millions of relationships and you know and that's not to mention the relationships between those formally those things that have formal references and the things that don't right because there's an awful lot of data and other stuff out there that hasn't traditionally been included in in scholarly references that of course I think both of us are interested in seeing referred to more robustly. But skill ability thankfully is not a giant true work media so we'll solve that problem once we get there. So I have a little question which activates a little bit from the referral conversation in and out of media. So traditionally the DOI documents or gives numbers to publications actual publications scholarly publications but as publishing moves from the traditional format into more online publication including actual Wikipedia comments still have a lot of news articles that we use on the articles. What's the future plan for the DOI to start putting more of those publications stabilize and eliminate the role of URLs? Right so the DOI I mean the DOI is just a you know this is a point that I keep making traditionally the Crossref DOI has been used for identifying scholarly articles and particularly for citing scholarly articles and that's an important thing to note right because as an identifier works at a different level than for instance maybe an identifier would if it were being used as a serial number or as a supply chain management identifier. Let me give you an example right the EPUB PDF and HTML versions of an article are intellectually equivalent right you don't want different identifiers for them for the purposes of citation but you might want different identifiers for them if you were for instance trying to measure how many you know units you were selling of each format or something like that. So particularly in the case of DOIs we're using them for referring to intellectually distinct objects and that poses some interesting problems with and it and a lot of people say well there's a whole new category of research objects that are different because they're more dynamic because they're changing all the time with you being a classic example of of of content where it's you know in flux effectively except that you still kind of need this ability to be able to refer to a particular in you know version of it. The expectation that you have of a citation is slightly different than the expectation that most people have of a link. People are perfectly content to have a link to something and and and you know and and expect that thing perhaps to change over time but if you have a citation you want to see what the author saw when they cited it maybe fifty hundred years ago right so you'll have some different behaviors there. So I hope that I'm getting to your question which is that we don't see a fundamental problem with using DOIs with other kinds of research objects things like data sets media highly versioned and dynamic content and this is a place that we're evolving the DOI to right now and we're working very closely with our sort of sister organization data site who in particular is trying to figure out how to is is adapting the system to work better with dynamic data sets and so on and so forth but the thing we're always keeping in mind is that of course the goal both of data site and of crossref is to preserve this notion of we're citing something and we want somebody who has cited something whether it's a data set or media or an article or some new kind of communication you know the expectation is that they will see what the author saw or as close to what the author saw as possible that's a really important thing to keep in mind does that do I have a question for you how about using wiki data queries to identify unscited or poorly cited statements like statements that aren't all cited currently in wiki wikipedia itself the question is how easy it is today to identify statements that are poorly referenced or sourced in wikipedia yeah the answer is that it's very hard because wiki data stores a tiny fraction of statements that we have in wikipedia so wiki data stores on the one hand statements that just don't exist in wikipedia anywhere but it also stores I guess a subset of statements we have in wikipedia and it's very hard to find via wiki data possible statements we have in wikipedia that would need sources but in a way we have we have other data sets like wikipedia for example that can be used to identify potential gaps looking at wikipedia wikipedia itself wiki data external knowledge bases you can try and more easily identify gaps maybe direct attention of this effort where it's needed itself sourcing I have another question can you confirm about wikipedia being used as an entry point at least among teachers in university folks I know oh he can't confirm this person can confirm they quickly glance at wiki and look for paper maybe that wasn't a question sorry it was a statement I suppose I mean one of the things I'm interested in doing and that we really haven't done but would be to see if there are sort of like particularly you know sort of particular articles that send an awful lot of traffic or disproportionate amount of traffic in there and we just haven't we haven't done that yet so looking at particular you know fields we have a very rough approximation of how to categorize the journals that DOIs are in you know whether they're biomedical humanities social sciences things like that and and then beyond that also knowing specifically if there are some you know sort of super articles that are sending a lot of references that would be interesting and that also reminds me that we have a project in the pipeline to segment readers as a function of how they consume wikipedia so I'm thinking of what some of the data we were going to get from that segmentation effort looking at whether people are looking for the club cup the fact for an overview of the topic or for in-depth information topic might also help us identify articles that belong to different types in terms of information of who was reading them and this probably is a good time to bring up we're very aware and Dario in particular has been working with us very aware of the sort of the creepy factor of potentially invading people's privacy here and and this is actually a big concern that's being that's developing in in our in the scholar communication space period which is we've gone you know libraries have been really really careful about preserving the privacy of their patrons in in meat space and they haven't really kind of done the same thing in the digital space but they're beginning to realize they haven't done that and they're beginning to get you know justifiably worried about it so one of the things that we hit when we were working on this project with wikipedia was of course when you switch to hdps all of a sudden all your traffic went dark right and and so we worked very closely to try and figure out what some first of all to make sure that our systems supported hdps and then also to figure out how we could in fact still get some information without personally identifying information and so on and so forth so that we could actually continue to see how much traffic was being driven from wikipedia so this is a big big issue so yeah you know we sit there we go oh we want to know all these things about like the users and stuff like that and that actually makes people very nervous as well so i just wanted to uh as a vocal of the irc you say thank you very much for the presentation right thanks everybody i can use the wrap see you on the internets