 Okay. Good morning everybody. My name is Dario. I run research at the Wikimedia Foundation and I'd like to welcome you to this ground-backed talk about site agents in Wikipedia. I'm thrilled to have an extraordinary guest today, Jack Bilder Crossref, he's Director of Strategic Initiatives at Crossref. For those of you who don't know, Crossref isn't sure that Jack is going to give you plenty of context on this, but if you see that this meaningless sequence of characters is click-hung when you want to access the product paper, you can blame this guy. So we're going to talk about the interface between Wikimedia and Crossref and talk about site agents and data. We're going to have two presentations. So beautiful logistics. There's a live stream you can follow on YouTube and for live conversations you can join the IOC channel with Wikimedia and also where Abby will be moderating. So two presentations we're going to have a discussion at the end of this. And with that, I'll give you Jack Bilder. Okay, thanks, Dario. So I should probably start a little bit about explaining something that your character says that Dario mentioned and also probably a little bit about yourself. Dario mentioned I'm the Director of Strategic Initiatives at Crossref and that's really a great news way of saying I do good stuff. So I'm kind of already directed there. And a bunch of the things that have come out at Crossref, some of which you might be familiar with, some of them are things like Crosscheck, which is a plagiarism check system that seems to have a scholar publishing industry, FundRef or Fund for Identifiers, which are used heavily for trying to identify research that comes under certain funding branches. And the thing that you may be seeing a lot more of is Orchid, which actually also came out Crossref and my group, and then spun out as a separate organization. So these are the kinds of things that we work on. Largely, I think, if you were to characterize stuff that we work on, it's infrastructure, scholarly infrastructure. And ideally, one of the things that is characteristic of infrastructure is that it looks like this, right? That is that when it's working, you don't notice it. It's only when it breaks that you actually notice it. It's only when things are incompatible that you notice it. So DOIs to the degree that you don't notice them consider that success. Having said that, that leaves us with a kind of difficult situation, which is that we often have to explain a little bit about what they do because it's not immediately obvious what they do. So some background with Crossref were a 5,000 membership organization, we're a non-profit, and generally our members are scholarly publishers, but we actually define publishers with a small p. That is, it's not necessary that they be professional publishers. You've got lots of members who, for instance, have publishing operations, but don't consider themselves to be publishers. So places like the World Bank, the OECD, you know, the International Monetary Fund, like that. We don't discriminate on business models, support all contents types, all disciplines, and we also have a lot of library affiliates and things like that, joined Crossref as well. And what we do, and our major claim to fame was this DOI, DOI stands for Digital Object Identifier. It's the identifier that's digital, it's not necessarily the object that needs to be digital, that's being identified. And these things, as Darryl mentioned, are kind of weird or fake string, but they have a particular function in scholarly communication, and it's this. In the late 1990s, early 2000s, when publishers were putting content online, scholarly publishers were putting content online, they realized immediately that of course one of the most useful things that researchers could do would be to be able to click on references where the reference was available online and be taken automatically to that reference. It seems obvious, but there were some logistical problems actually doing, not the least of which is that there are thousands of publishers and you had to have some sort of idea of what the URL structure was going to be on the publisher's sites, and then of course if publishers changed that URL structure or did anything weird, a lot of those links would break. And this is actually the major problem Crossref is trying to address. And it may seem like an easy thing to do, I mean, after all, one thing that you could do is tell publishers to stop creating things, right? Update your URLs when you're modified by your websites. And that works for a class of broken links, and actually probably for the vast majority of broken links, but that only addresses one issue of what happens when things typically broke break. That is when an organization is too lazy, you know, they go and they change the structure of their website, they still control the domain, but they don't redirect to the old URLs. That's by far the most common source of broken links. But there's another source of broken links that's a lot harder to deal with, and that's when this part of the URL changes. That is when an organization renames itself or splits into two or merges or does something like that, then you might not have control of the URLs anymore, and so you need some mechanism in which that actually support things still even when these situations occur. And you might think that these situations are uncommon, and you might think that they're particularly uncommon for publishers and government organizations and kinds of entities that participate in Crossref, but there are a lot of reasons that these things can change. It's not just that things get sold, it's, you know, creation, transfer, poverty, forgetfulness, and increasingly, as you see people using URIs as personal identifiers, sometimes that means that you no longer control the domain that they control at one point. And as I say, there are a lot of reasons that links break, and if you just go and, you know, a lot of people think, oh well, university links aren't going to break, and it's been around for a long time. Government links aren't going to break. Actually, government links have highest rate of, it's almost any kind of organization, which is kind of not surprising if you think about it because if the government comes in, what they want to do is sort of obliterate the history of the old government so they rearrange all the websites and they don't really have an interest in interacting with people to think. You know, countries change name, you know, all of these things happen, and that's when the domain name might actually change. So these are the things that we're trying to address, and particularly we're trying to address them, well, one reason, of course, is because links are important to your votes, just like they are in the web, but more importantly in the case of scholarly articles, they represent the evidence record, the scholarly citation, right? These are, if this goes away and no longer see the evidence for the claims of remaining scholarly articles, and in addition to that, I might not know who is making those claims. So this is really important, particularly in the case of research, so we really had to do it. Now the mechanism that DOIs use is not, it's not a technically difficult problem to solve, it's a more socially difficult problem to solve, but what DOIs do is they give you this, they work very much like a card catalog, right? If you go to a library, a physical library, in the old days you would go, and if you looked in the card catalog, they would not tell you, the card catalog would not tell you a book was, you know, on the third floor on the fifth shelf from the back, it would give you a call number, and then the call numbers would be mapped to physical locations. And that meant that if they rearranged the library or reshelf books, they didn't have to update the card catalog, they could just change the mapping of the call numbers. And DOIs work exactly the same way, that horrible opaque string that Mario mentioned earlier is in fact the URL that is a pointer. It goes, it says, okay, where physically does this URL exist now? So if a publisher changes their website, or if they get acquired, or if they change a domain name, or something like that, all they have to do is update those pointers, and all the old links using the DOIs will continue to work. So this allows us to persist links, you know, and this is an important term, persist, we don't claim that they're going to be permanent, persist is sort of a synonym more for stubbornness, say they're stubborn links, we will, you know, do our best to update them, we will contact our members, we will do things like find them and turn off their ability to posit if they don't, they don't update things. But by and large they do, because of course this is a great source of traffic for them, it's a, and they take this stuff seriously, they take scholarly references seriously. So that's the way that the system works. As I said, technology isn't important, it's a simple redirect, almost every URL shortener on earth uses similar technology, but the thing is that they don't actually have an organization behind it, and they don't have a membership model that requires the members to actually adhere to certain conventions and to actually behave in a certain way. So the technology is very similar, but the organization is a little bit different. And again, this organization, you know, mediates this problem of having to do multiple bilateral agreements between what really are thousands of publishers, you know, a lot of people when they're thinking about scholarly publishers to the degree that they know them at all, think of the big ones Elsevier, Spur, Nature, and so on and so forth. And of course there are members, but if you actually look at our membership list, it's now approaching 6,000 members, and that's an awful lot of people trying to track them. So this central linking switchboard really helps. So what does this have to do with the wikipedia? All right, well years ago when I first joined Crossref, one of the things I was interested in doing was understanding a little bit about how non-scholarly content was making use of scholarly literature. So, you know, blogs linking to scholarly literature were social networks linking to the scholarly literature. And at the time, one of the things that I looked at, this was about back in 2007, was I looked at wikipedia, and I downloaded a dump of wikipedia at the time and analyzed it to see how many DOIs seemed to be being referenced and also to look at references themselves to see if they might, even if they didn't have DOIs, whether they looked like they were scholarly. And at the time the answer was there were a few, but there were a lot. And so I sort of promised that I'd go back to the problem later on, because it also looked like that things were changing and things were growing. And we did come back to that, to the problem. Somebody from my group, we went and we took the referral logs for the DOI system, that is the things that show us every time somebody clicks on a DOI and where it redirects to. And we looked at those referral logs to analyze who was driving traffic to our member sites. And we were interested broadly in just understanding particularly how traffic was being driven from non-publisher sites, because we have a pretty good idea of how much the publishers themselves drive. And when we did that, we were really, really surprised because we learned that wikipedia is the fifth largest referrer of DOIs to the scholarly literature in the world. Now, I usually use weasel words around this because the actual referrer logs are a little bit hard to interpret. There's a lot of noise in them, but I'm pretty confident that this is the case now after having looked at the logs again. And I'm actually, you know, I think that the potential here is actually an understatement, because the other thing that we've done, and I know Dario's done this as well, is we've looked at wikipedia articles to determine, A, how many DOIs there are in the references and how many of those DOIs, how many are linked, and not all DOIs that are in the references are linked. And then secondly, not all references to the scholarly literature use DOIs. And that's largely because they're sort of a, you know, something that you only know about if you're in the trade and you wouldn't necessarily know if you weren't a researcher that should add a DOI, and even if you were a researcher, you might not know how to look up the DOI and that you should link with it. So we think that there's probably a date or probably a lot more references to the scholarly literature, probably, and even traffic suggests. So what this translates to is that back in 2013, when we first, when we last analyzed this data, I tried to get a refresh of the data to present today, but it's still being processed, so I don't have the numbers. But back then, it was about 20,000 to 30,000 girls a day. It was increasing by about 2,000 girls a day during the eight month period that we analyzed. And the top 10 subdomains that we saw coming in were naturally the English wikipedia, but then also a lot of traffic from some of the other local language wikipedia. So this really made it clear that we wanted to engage more closely with the wikipedia and to see what we could do to A, understand the, you know, the traffic that was coming from there, and B, also see if we could actually help the editors of wikipedia articles to link persistently to the scholarly literature. And so I had run across Daniel Muchin, some of you might know a number of times, a number of conferences, and we had been talking about this for a long time, and he suggested that we put together sort of a little group of, but I think he termed it wikimedia ambassadors because they weren't wikimedians in residence quite. They were people who were interested in scholarly literature. Back in London at wikimania 2014, we first met and talked about some of the things that we might be able to do. The initial group was Daniel, Matt Sennett, Max, a million Klein, and Dorothy Howard. But then a few of them got distracted by other things, and ultimately last year we got together with Max, a million Klein and Anthony DeFranco to work out on a project specifically to take advantage of something that had just been launched by on the wikipedia, which is a live stream of edits, and to see what we could actually do with that stream. And so we worked with them and they prototyped the system that they called Qasaitis that looked at this stream and tried to pull out DOIs. And a few, I think a few months later, sometime in March, we announced that we had a live wikipedia stream of wikipedia edits, showing things like in your real time, whenever anybody cited or critically unsighted a DOI in any of the wikipedia. So if you go right now, I hope it's live, to wikipedia.labs.crossriff.org, you'll see basically this live update of wikipedia edits, and particularly articles that are citing the scholarly literature, and as I said, sometimes unsighting as well. And this has been a really useful exercise for us because it's really got our members and scholarly publishers interested in traffic that's being driven to them from outside the scholarly literature. It's really highlighted this as a source of a lot of interest. And so they've really been encouraging us to actually further explore what we can do to encourage people to link using persistent identifiers and B for us to monitor this and help build tools that help people not just use DOIs, but use other kinds of persistent identifiers like PMIDs and workables and stuff like that. And most recently, just this week, again, Joe Woss, who works for me in Oxford in the UK, decided that he was going to take advantage of the fact that the Raspberry Pi Zero had come out. And he went out and he managed to snag one of the last ones before they sold out. And at the beginning of the week, he posted this little experiment that he conducted where he put together a Raspberry Pi, a basically a real-time framed Raspberry Pi Zero that's showing us how many DOI citation events or uns citation events are occurring. And so now we proudly have this on our wall in our office at Oxford next to the refrigerator. So if that doesn't show you how important we think this stuff is, nothing, nothing. And I'll just note that the other thing, the thing that is hanging below here is a collection of telephone adapters that I put together in the old days when I used to travel a lot and I used to have to carry all of those things around with me all the time. If there's a better illustration of the benefits of standards, I don't know what it is, but that's also hanging up in our office. So ultimately, I think, again, this gets back to an overall goal that we have at Crossref, which is to understand how non-scholarly sources are engaging scholarly literature. We think that this is important for us to monitor. Increasingly, it's important for researchers to know this, for funders to know this, for publishers to know this, to know that research, the formally published research is having some sort of an effect that is being accessed by civilians out there, not just scholars. And so ultimately, our goal is to feed this into a project that we'll be launching in the middle of next year called the DOI Event Tracker, which is a general purpose framework where we want to gather information about the usage of DOIs in all sorts of different sources, whether it's Wikipedia or Twitter or Logpost or social networking sites, so that we can build a pool of data that can be used for, amongst other things, building awareness applications, perhaps building metrics and so on and so forth. And again, we see this as a thing that we're uniquely positioned to do because we can basically create a mechanism for collecting information for our 5,000 members from however many different social media platforms or other platforms that might be engaging literature. So we're building this tool that goes out and collects data and stores it and makes it open and this is really critical. As we collect this information, we want to make sure that people can maximum use it. We don't want to compete with the organizations that are doing sort of professional stuff, general category of organizations that are building alt metrics that are doing value add services, reports, analyses and things like that, but what we want to do is make sure that these things don't turn into the equivalent of the Thomson Reuters of our day. That is that Thomson Reuters is a big organization, well there are two big organizations that basically are the only organizations that have an overall view of traditional site patients in the literature and I'm sure Daria is going to talk about that in his talk, but we want to make sure that if we're collecting new sources of information about usage that that data is open, comparable, auditable and portable. That is we want to make sure that this raw data belongs to the community and that everybody can make use of it. So the next things that we're going to be working on and that we're looking at working on and part of our goal here visiting Daria is to actually see whether we can collaborate on some of these things. We're interested in further analysis of the referrals by subject category. We have a strong suspicion that the DOI referrals that we're seeing that is clicking on DOIs and following the literature occur in specific subject categories, possibly biomedical literature and stuff like that. We want to also start searching for directly linked references. Those people who for instance linked something on a publisher site didn't use the DOI and put the publisher URL to the article in there and see if you can map that back to the DOI and collect that kind of information and search for unlinked scholarly references and not just articles but monographs into what we call the sort of orthogonal related literature, things like patents and standards and other things that are sort of closely tied to the scholarly literature and often interact with them. We're very interested in working with people to improve citation tools that are being built for Wikimedia and for the Wikipedia platform. So for instance, there's a lot of interest in collecting some of this information and making sure that references are pulled out of Wikidata and we're happy to feed Wikidata so that they've got those references so that not every reference is an individual string. We think that by doing this you probably will get a lot of benefits for instance by using some of the Crossref metadata you'll be able to do things like flag references that open access literature, flag references to literature that perhaps has been updated, that is for instance corrected or in extremists retracted or withdrawn. So we think that there are all sorts of kinds of tools that can be built off of this that will make the reference and citation process in Wikipedia sort of more useful and more dynamic. As I said, strong suspect, I'll talk about some of those projects. So basically that really is what we've been doing with Wikipedia. I think it's an overall project of ours to look at how things interact with scholarly literature. Wikipedia is by far the largest source of referrals and contact with the scholarly literature that we've seen. I say it's like banks right under the top sort of aggregation platforms like SOFIS and Web of Science, things like that. If you're interested in tracking the stuff that you do further, look at our blog.crossref.org or labs.crossref.org where we post all of our experiments and talk about them and of course you can contact us directly. Joe Was is the person who's been working most directly on a lot of this kind of stuff. The person's built that tool, but also a lot of it depends on the other guy in my team, our award API that Dario and others are planning on using to mine some of these references and of course you can contact me if you're interested in any of the stuff. I'm sure I'll be working, we'll be working pretty closely with Dart. So anyway, that's it. I think I actually hit. Fantastic. Thanks, Jeff. This is the best possible introduction to my talk a bit of a lot of. Brenda, can you help me switch to the Google Docs? What we're doing is I want to say thanks that Jeff covered one issue that is really critical in terms of the disappearance of links, link rot. Huge, huge issue if you want this network of references to remain persistent and available. I'm going to cover another aspect of issues that we have with links and sourcing. This can be the main focus of my talk today. Okay, this is working. Yeah. All right. So briefly, what I am going to do, I, the regular researcher at the Wikimedia Foundation, today I'm not going to talk about what my team is doing. I'm talking about something I'm passionate about and something I've been collaborating with for a while. So I'm really interested in the question of how we use open knowledge and collaborative created reference materials like those that you find on Wikipedia as the entry point towards the scholar literature, right? And Cameron Nealon is the one who came out with his brilliant awarding of Wikipedia as the front matter of all research. He also did this great workshop at Wikimedia in 2014. A few years before the Chronicle of Hard Education called Wikipedia the highest layer without formal vetting that represents the ideal bridge between the validated and the unvalidated web. And I found this like a pretty strong metaphor for what Wikipedia represents. So typically when people start talking about access to scientific knowledge, the audience expects like a rent about paywalls, about how broken the publishing model is. And today instead I want to talk about technology. And so you hear me talk about unique identifiers, you hear me talk about structured data, you hear me talk about bibliographic metadata and wiki data, and you hear me talk about goats. And the reason why I want to talk about goats today is that I want to give you an example of what I like to call the disappearance of provenance. I think this is the main topic I want to talk about and ways we have to counter this general tendency that we see today on the web. So my goal today is threefold. I want to try and persuade you that algorithms used by search engines are undermining linking and sourcing. Otherwise two of the fundamental penises of the open web and ultimately Wikipedia, which builds itself on the notion of sourcing and verifiability. I want to try and persuade you that wiki data is a solution to this problem. And finally I want to introduce you to a little known project led by the community called wiki project source metadata on wiki data, which is actually trying to address this issue. So back to goats. As I'm sure you all know, Wikipedia has an amazingly well-resourced article about goats. Their anatomy, their diet, their evolutionary history, their presence in fiction and pop culture with dozens of references and up to 135 languages. We really have plenty of very well-resourced information about goats on Wikipedia. And somewhere in the middle of the English Wikipedia article, you will find a very inconspicuous sentence about the average lifespan of a goat with footnotes and references that you can look up to verify that statement. Now this is what a popular search engine gives you when you search for the average goat lifespan, 15 to 18 years period. Now let's take a look at this. This information seems to have been extracted from Wikipedia, or at least from the source that is cited by Wikipedia, but it's presented with no provenance information whatsoever. And you can try this exercise with plenty of statements that you can think of that they can have like a quick answer of this kind. And this is not an accident. This is something that search engines are heavily investing into. I don't know if many of you have heard of the Google Knowledge Vault, quick show of hands. Yes, a bunch of you. Okay. So the Knowledge Vault is one of the most ambitious projects Google is currently working on in terms of like going beyond the free base as its semantic engine. And the idea is that Google can currently crunches vast catalog of documents in its index and generate what they call confident facts. And confident facts are basically triples, like statements like the ones we just saw. They're extracted from all the sources and they are given a given level of confidence as a function of how many sources back the statements. So in other words, references just use a signal to determine the confidence that Google has in the truth of these statements. They're not necessarily represented as the output of the algorithm, right? They're instrumental to generating these statements. And I think this is a trouble that we need to be aware of and direction that raises some interesting questions. Obviously, Google is not doing this for fun. This is basically the dream of the link data vision, right? Where you can have a quick answers that you can generate just by looking up a pool of connected facts. And with most people shifting to mobile, this need for quick answers is becoming even more pressing. So this is an example of what you see when you ask Siri, how long do Greyhounds live? You get an answer in these cases from Wolfram Alpha, but it's pretty much the same story, right? The answer is about 11 years, no source whatsoever. And interestingly, if you go and check what Wolfram Alpha is doing, they have a big disclaimer and they tell you that you shouldn't really expect to pinpoint any single source for that statement. They really warn you that is not what you should be doing. And the reason is the same. So they have a bunch of sources they're using. They want you to trust the answer engine, right? As the authority for that state. Providence-free information, that's what users are looking for. They only care about the actual statement, not the source backing it. So David Weinberger made a fantastic and compelling point back in 2012 when he basically called out the despicable behavior of these websites that are linking to themselves, as opposed to linking to external sources. And he called these sites a stopping point in the College of Information. And he basically argued that by taking the links out of content, these websites are turning themselves into the authority for this information, right? As opposed to allowing people to verify these knowledge claims in the original context. And in a recent conversation with him, he actually reminded me that there's a war for this and there's a beautiful precedent in history for this notion of Providence-free information. And that's the Almanac, right? So the Almanac is basically the best example of an answer engine that doesn't give you any context about the source. And humanity in 2015 is recreating the Almanac thanks to algorithms, which is an amazing achievement of our civilization. So we do have a project that is an answer engine that doesn't quite work like an Almanac and that acts as a Providence-preserving answer engine. And that's Wikipedia, right? So Wikipedia is all about verifiability. It's not about truth. It's about providing access to reliable sources. And its reputation does come from the fact that you can look up and verify by yourself this information. That's what makes Wikipedia such an outstanding public service. That's why Wikipedia is not an authority in itself. It's an entry point towards authoritative sources. And you might think that we're doing a good job, given this is basically our official mission, but things are far from being perfect. What you see here is the breakdown of statements we have on Wikidata as a function of the source that they have. And up to 80% of these statements as of last month are either using Wikipedia as their main source or have no source whatsoever. So only roughly 20% of total statements in Wikidata do have sources, an external source that is transparently represented as a back in that statement. This is giant tech data on the quality of data that Wikidata is generating. So how do we start tackling this problem? And I think we have an outstanding opportunity at Wikimedia to address this issue by building a human annotated collaborative repository of all citation and source data and try and bridge the gap between information we present on our projects and the sources that back it. So in a perfect word, where all information can be represented as a connected pool of statements, metadata about the sources and sources themselves should be part of this database, right? There's no reason why this should not be part of a database. And there have been countless efforts to try and build a comprehensive repository of open bibliographic data. And Crossref represents probably the most accomplished effort in the direction. But Crossref only represents some of this data and makes only some of this data publicly available to the general public. And Crossref data, for example, is not integrated into a knowledge base. It's very hard to go from a bibliographic reference and maybe the author information to the topic of that reference or to the institutions to some extent. Basically, Crossref data is a fantastic but insular source of information about sources. How do we connect this to knowledge bases that we're so good at building here at Wikimedia? So within Wikimedia itself, there have been at least 10 years of attempts that I'm aware of at building a repository of bibliographic information for all these citations that we have in Wikipedia. These efforts go back to 2005. None succeeded. They predate largely Wikidata. But finally, we're here. So finally, we have Wikidata. And I think we have a project that has the vision, the technology, the community who is geeking out about metadata and structured data, the ability to operate at scale, the right licensing model, because we're talking about mostly CC0 non-copyrightable information. And finally, and importantly, the independence. Because we want to have a service that is not a spin-off of a publisher, but is a central repository that is not subject to commercial interest by third parties. And so basically, we have all the conditions today. Wikidata has all we need to start building this human curator repository of all citations starting from citations using Wikipedia. And these sum of all human citations could then feed into a large ecosystem of consumer services. Internal integration of Wikipedia and Commons. Jeff mentioned automatic services before that are measuring the impact of citations. And also, we have, of course, all of the providers of scholarly metadata and identifiers. So how do we start building this? Well, it turns out there's a group of people in the librarian and scholarly and Wikimedia community, including myself, who got together and started figuring out how to build this. So meet Wikiproject source metadata that you can look up on Wikidata. So the first goal of these people was to try and determine a data model for storing source metadata in Wikidata as items. The idea is, what are the properties that we need to represent the core set of information we want to store for any given citation, like journal article properties, journal properties, book properties, author-related properties and so on. And I think we did a pretty good job. We have now a solid schema that has been discussed and invented. And we start having, in Wikidata itself, plenty of publications are already stored using this schema. So this is an example of a journal article published in plus one, one of the journals by the Public Library of Science. This is the title. You can see the license, the language in which it's written, the authors, publication date, and main subject and even a picture of what this paper is about. So all of this is possible today and we already have quite a few publications in Wikidata already using this schema. But what are the actual benefits of this approach? At the end of the day you could say, you know, we have all these references in Wikipedia already and they're structured to some extent. This is what a template looks like in English Wikipedia, site journal template. Now the trouble with this approach is that right now all of these citations are buried in the body of an article. There's no way you can interact with them as items in a database, right? You need to parse content. It's a huge pain to try and make any sense of this. And so people still, if they want to say something, they need to add a reference to the actual article. How, if we could instead cite by reference, as you can do in most reference managers today, how we could pull the reference from a database like Wikidata and cite it by referring to the item that exists in Wikidata and generate the appropriate reference string. So this is possible today on many Wikis by using these templates that allow you to pull information from Wikidata. This is going to be one of the first benefits of using Wikidata as a source for this data. This is going to be a bit of a geeky technical slide, but it's one that I feel strongly about. So Wikidata can also be the place for storing all the mappings of identifiers. And if you've never heard of an identifier mapping or 3D control, you can safely skip this slide, but if you have, and you've played with something like a Magnus Mix and Match tool, for example. So this tool that allows you to cross-reference Wikidata items with external catalogs, you realize how important it is to have a place where you can represent the fact that a given scholarly article that is identified on Wikidata via its own QID is the same as an article that has a DOI, a PubMed ID, et cetera, et cetera, and the same for authors. So authority control via ORCID, via VEAF, via Google Scholar ID, and you name it. So Wikidata can become the place that holds all the mappings that describe the same objects. There's also this growing interest that Jeff mentioned in measuring the impact of citations beyond scholarly citations. So this is an example of a popular service called Outmetric.com. And what they're doing is similar to what the DOI event tracker is doing. So they're basically monitoring usage of scholarly citations in Wikipedia, mostly to give credit to authors, to funding bodies to show, hey, your work is actually being cited, not just in journals, but also in the most popular encyclopedia online that people are reading on a daily basis. So to me, possibly the most interesting implication of this project is the ability to cross-reference sources, as I said, with the vast body of knowledge that we have in Wikidata. So again, if you're using a cross-reference API, you'll be able to know something about this specific DOI and the direct metadata associated with it. By caching this data in Wikidata and cross-referring this data with all the other structured data we have in Wikidata, you can start going from a paper to its main subject, from the subject, in this case, to the taxon that is referenced in that paper, from the paper to the author, from the author to the affiliation, from the affiliation to the location, and you name it. So this is basically the rabbit hole applied to structured data. And so the next two slides are a bit of a moonshot. So this is something that's not within the scope of what we're considering is going to happen anytime soon. But once you have source metadata in Wikidata, you can also start thinking about annotating the sources themselves. And so Jeff mentioned licenses, like knowing whether a specific paper is open license or not is something that got quite a lot of attention recently. That's something the cross-reference API can provide. But you could imagine of adding other types of properties to these papers. So for example, retraction, you can start representing links between different sources, for example, A citing B. You can even go as far as representing the semantic type of these relations, right? There's a proposal for what is called citational apology that has been discussed for a couple of years. And you could imagine the given Wikidata allows you to design arbitrary relations, you could imagine properties allow you to describe that the source A extends the source B or that source A is using methods or data or whatever already used in B. Or even express disagreement and conflicts between sources. So stuff that, as of today, is really, really hard to extract just by looking at the citation graph of the literature. And finally, once we have all of this data stored in Wikidata and easily queryable and analyzable, you can start answering some of these questions, which unfortunately today are virtually impossible to answer at the push of a button. Give me all the publications in pharmacology from the 90s that have been retracted. Give me all the facts that we have on Wikidata that are backed by works of physicists who graduated the given universities in the 80s. Give me all statements that are supported by articles published in the New York Times, et cetera, et cetera. So all this work that is really fundamental for the validation of this knowledge as a function of sources will suddenly become possible if we have all this data queryable in Wikidata. So I'm going to close here and say that in some, we have the ability of start building an answer engine that is not an almanac, an answer engine that is a preserving provenance, and we can do this by using existing technology, basically Wikipedia and the properties and data models that we already have on Wikidata. The next steps, we're talking to a bunch of people to run a pilot to try and populate Wikidata or potentially sandbox version of it. James here set up an instance similar to Wikidata called library base that we can use for testing and sandbox purposes to try and see if we can get a community to import and curate this data. We're going to start designing strategies for automatically importing some of this data from, for example, the Croser API and linking them to the corresponding statements. We need to understand and we need to refine the data model if we need to represent properties. Some properties are not the correct ones. There's a big, big issue of entity disambiguation. Once you start having all these authors called John Smith with multiple, multiple IDs, the question becomes how you disambiguate them. And finally, most importantly, how can you design a system that allows you to ingest this data while preserving the human curation layer? It's probably the biggest challenge, right? It's easy to import all this data automatically. I think that 70 million DOIs you guys have will be relatively easy to ingest. It's not an impossible task to ingest into Wikidata. The question is, how do we do this in a way that preserves the human curation layer? And with that, I conclude here. Thanks for your attention and I'd be happy to take any questions. Jeff, you can join me here and we have a relaying questions for my RC. It is. Look it up on Google. Hey, Dario. Yes. This is Jake. I might have missed. Is this question time? It is. Hi, Jake. Great presentation, both of you. Awesome stuff. I'm curious, obviously the vision here is to capture all citations ever, but our kind of shorter term or more proximate issue is just capturing all the citations that are on Wikipedia. I was wondering if you could talk about the relationship between those two stages? Yeah, that's a good question. And the short answer is, I don't know how this is going to play out. What I know is that we want to start small. Like I said, we cannot just start thinking of ingesting all this data. Again, 70 million statements is not an impossible size comparing to what we could store in Wikidata. But that probably wouldn't be desirable. So I think the first step is going to be after we sandbox and test the data model, like I said, focus on one community. We have people in the chemistry and biology, you know, genetics community very interested in exploring this idea of a pilot and basically start from there. And yes, I guess the first ambitious goal is to have all citations from Wikimedia projects stored in Wikidata. And then the next step is going to be to put Scopus and Web of Science out of business. It's going to take a bit longer time. If there's not anyone else asking questions, I might. So Alex Stenson and the Wikipedia library team, one of the things that kind of brought us into those conversations looking at how libraries make recommendations to researchers doing work in their libraries and using their databases. And what we have with Wikipedia is hand curated citations that are kind of grouped by topic areas based on how categories of Wiki projects work on Wikipedia. So we can kind of tell like if a X journal shows up and, you know, certain percentage of these 10,000 articles in this topic, this might be a journal to start your research on X type of novel, right? Or X type of biography or whatever history. So I think the human curated element of this is really fascinating to me because we could take the human curated citations from Wikipedia and make really robust kind of research build into library infrastructure. So I'd like to hear you talk about a little bit more about that relationship between like making recommendations for researchers and the human bit that's so powerful Wikipedia. Jeff, any thoughts on this? I mean, see if this, so there was something that I was going to say and that is that one of the reasons that I'm actually interested also in getting a better, that I think we're interested in getting a better picture of where there are lots of scholarly references and where there aren't is because I think I have a suspicion, particularly that the references, if we're getting lots of references being followed from certain articles that there may be, that it may be the case that a lot of Wikipedia articles are serving a purpose that was one that has traditionally been performed by what are called review articles. And I've got the feeling that the Wikipedia might actually is starting to become an entry point for people who are doing research in this. And it sounds to me like that's kind of related to what you're talking about. And if you could, in fact, flag those articles and say, hey, actually, these seem to be really good entry points for certain topics and so on and so forth. And also, I think that that might be quite useful as far as recommendations go. Yeah, the other thought that I had was related to the maintenance of select quality of this data. I know in some cases we have author names that are misrepresented. And I guess one of the assets we have in Wikidata or in Wikipedia in general, is that when we cache this information, we can also revision control it. So if a human notices an incorrect spelling of an author name, or in some cases, their actual titles that are not correctly represented, or there's a mismatch between, I guess, the official record on the publisher site and Crossref, I guess one way of thinking of the human duration layer would be also to address that issue of potential errors in data. Right. And I mean, Crossref has been interested in this area a lot too, because one of the things that we find with things like Orchid and Crossref and things like Archive and even PMC is that none of them are set up so that other people can make assertions or corrections about the metadata easily. And in our view, we would like to encourage, and this is part of what we're looking at with the DOI event tracking system, is to allow people to put information in there about DOIs that maybe hasn't been put in by the publisher. So for instance, if a funder knows that a certain DOI identifying an article, that that article was about something that had been funded by them, they might want to be able to make that assertion. Or if the researcher does that, they might want to be able to make that assertion. As far as versions go, we actually think that collecting all of those statements together might be useful, even if they contradict each other. Of course, if they don't contradict each other, that gives you a bit more sense that it might be actually correct the statement. But if they do contradict each other, that's also useful information to know. So we're interested in sort of storing these assertions or claims as we state them so that people can harvest them and see claims from different parties about the same identifiers. It is actually a great point. The idea of having some of this data representing wiki data means that we also have talk pages. So you now have an entry point that is collaboratively editable, where you can have a discussion about a specific source. It's centralized, and it's open licensed. So it's a better way of saying what you would just refer to. I have a question from IRC. One is 70 millions plus probably an average of three authors. So 200 million items is a rough estimate. Affirmative? Yeah. So, yeah, when we talked about, you know, the an estimate of how many items and statements would have to be created on wiki data. If you start thinking of the item itself, the authors, the affiliation, the journal, we're talking about something around that order of magnitude. So again, it's not unthinkable given the current size of wiki data. There would be a far larger than the current size of data. So that's not something considering stage. Right. I mean, the 70 odd million, and I think it's 77 million, Crossref has 77 million DOIs assigned to research objects, the vast majority of which are articles. And so we have bibliographic and non-bibliographic information for that. The thing that in your case that you have to be considering is, of course, the number of relationships between those 77 million items. And that's going to be many hundreds of millions of relationships. And that's not to mention the relationships between those things that have formal references and the things that don't, because there's an awful lot of data and other stuff out there that hasn't traditionally been included in scholarly references that, of course, I think both of us are interested in seeing referred to more robustly. But scalability, thankfully, is not a giant shoe with media. So also the problem once we get there. So I have a little question which, in fact, deviates a little bit from the referral conversation in an auto Wikipedia. So traditionally, the DOI documents or gives numbers to publications, actual publications, scholarly publications. But as publishing moves from the traditional format into more an online publication, including actually Wikipedia comments, we still have a lot of news articles that we use on Wikipedia articles. What's the future plan for the DOI to start including more of those online publications, stabilize and eliminate the role of URLs? Right. So the DOI, I mean, the DOI is just a, this is a point that I keep making. Traditionally, the Crossref DOI has been used for identifying scholarly articles. And particularly for citing scholarly articles. And that's an important thing to note, right? Because as an identifier, it works at a different level than, for instance, maybe an identifier would if it were being used as a serial number or as a supply chain management identifier. Let me give you an example, right? The EPUB, PDF and HTML versions of an article are intellectually equivalent, right? You don't want different identifiers for them for the purposes of citation. But you might want different identifiers for them if you were, for instance, trying to measure how many, you know, units you were selling of each format or something like that. So particularly in the case of DOIs, we're using them for referring to intellectually distinct objects. And that poses some interesting problems with, and a lot of people say, well, there's a whole new category of research objects that are different because they're more dynamic, because they're changing all the time, Wikipedia being a classic example of content where it's, you know, in flux effectively, except that you still kind of need this ability to be able to refer to a particular, you know, version of it. The expectation that you have of a citation is slightly different than the expectation that most people have of a link. People are perfectly content to have a link to something and, you know, and expect that thing perhaps to change over time. But if you have a citation, you want to see what the authors saw when they cited it, maybe 50, 100 years ago, right? So you'll have some different behaviors there. So I hope that I'm getting to your question, which is that we don't see a fundamental problem with using DOIs with other kinds of research objects, things like datasets, media, highly versioned and dynamic content. And this is a place that we're evolving the DOI to right now, and we're working very closely with our sort of sister organization, Datasite, who in particular is trying to figure out how to, is adapting the system to work better with dynamic datasets and so on and so forth. But the thing we're always keeping in mind is that, of course, the goal both of Datasite and of Crossref is to preserve this notion of we're citing something and we want somebody who has cited something, whether it's a dataset or a media or an article or some new kind of communication, you know, the expectation is that they will see what the author saw or as close to what the author saw as possible. That's a really important thing to keep in mind. Does that? Do I have a question for you? How about using wiki data queries to identify unsighted or poorly cited statements like statements that aren't well cited currently in Wikipedia itself? The question is how easy it is today to identify statements that are poorly referenced or sourced in Wikipedia. I understand the question. Yeah, the answer is that it's very hard, because wiki data stores a tiny fraction of statements that we have in Wikipedia. So, wiki data stores, on the one hand, statements that just don't exist in Wikipedia, anywhere, but it also stores, I guess, a subset of statements we have in Wikipedia and it's very hard to find via wiki data possible statements we have in Wikipedia that would need sources. But in other words, we have other datasets like Wikipedia, for example, that can be used to identify potential gaps. Looking at Wikipedia, Wikipedia itself, wiki data, external knowledge bases, we can try and more easily identify gaps, maybe direct attention of this effort, where it's needed in terms of sourcing. I have another question. Can you confirm about Wikipedia being used as an entry point, at least among teachers and university folks? I know, oh, he can confirm, this person can confirm. They quickly glance at wiki and look for papers and go on to dig more related papers. Maybe that wasn't a question, sorry. It was a statement, I suppose. I mean, one of the things I'm interested in doing and that we really haven't done, but would be to see if there are sort of like particularly sort of particular articles that send an awful lot of traffic or disproportionate amount of traffic in there. We haven't done that yet. Looking at fields, we have a very rough approximation of how to categorize the journals that DOIs are in, whether they're biomedical, humanities, social sciences, things like that. And then beyond that, also knowing specifically if there are some sort of super articles that are sending a lot of references, that would be interesting. And that also reminds me that we have a project in the pipeline to segment readers as a function of how they consume Wikipedia. I'm thinking of one, some of the data we were going to get from that segmentation effort, looking at whether people are looking for a quick look up of a fact for an overview of the topic or for in-depth information on the topic might also help us identify articles that belong to different types in terms of information of who is reading them. And this probably is a good time to bring up. We're very aware, and Dario in particular has been working with us, very aware of the sort of the creepy factor of potentially invading people's privacy here. And this is actually a big concern that's developing in our, in the scholar communication space period, which is we've gone, libraries have been really, really careful about preserving the privacy of their patrons in meat space. And they haven't really kind of done the same thing in the digital space, but they're beginning to realize they haven't done that. And they're beginning to get, you know, justifiably worried about it. So one of the things that we hit when we were working on this project with Wikipedia was, of course, when you switch to HTTPS, all of a sudden all your traffic went dark, right? And so we worked very closely to try and figure out what, first of all, to make sure that our systems supported HTTPS, and then also to figure out how we could in fact still get some information without personally identifying information and so on and so forth, so that we could actually continue to see how much traffic was being driven from the Wikipedia. So this is a big, big issue. So yeah, you know, we sit there, we go, oh, we want to know all these things about like the users and stuff like that, and that actually makes people very nervous as well. So. I just wanted to, as a vocal of the IRC, say, you know, gallery there, there. Thank you very much for the presentation. All right. Thanks everybody. I think this is a wrap. See you on the Internet.