 Okay, so it might be time to get started Good afternoon. Thanks very much for attending my session. My name is Martin Klein I'm with the research library at the Los Alamos National Laboratory And I'll be talking about reference rot in scholarly communication today, and I'll offer a solution to the problem So this is not a it's not a tennis match. This is a team effort So I'd like to acknowledge is acknowledged my colleagues co-workers as of course here with men example who unfortunately cannot be here today Because he's on a sabbatical in Europe Sean Jones in hurry our Shankar from all of these three from Los Alamos as well and I have collaborators at the University of Edinburgh at the language technology group in particular clear grover and Richard Togan that I'd like to acknowledge and the last but not least Andy Jackson who looked into the decay of web resources at the web archive of the British Library and he and his graphs particularly inspired a lot of this work So what are we covering today is basically three main areas? I'd like to motivate The topic give you a brief introduction of what we're talking about here. I'd like to go and spend a little bit of time on our latest published research on the quantification precise quantification of reference rot in scholarly communication and last but not least as mentioned before I'll spend a little bit of time in proposing a solution of how we can address the problem All right, so let's get started with a few introductory comments, so But we talk about reference rot is basically two things that we talk about the one is Link rot and I'm sure everyone has seen the notion of Something like this the interim is four four page not found error you you visit of a page today You bookmark that your eye you try to revisit a couple of months later And then you get this right this is particularly ironic because that's the canonical website of the general assembly of the Internet International Internet Preservation Consortium But that's beside the point so link rot is the first aspect of content Drift of I'm sorry of reference rock the second aspect of reference rot is content drift all so what is that? The canonical website of the digital library conference in in the year 2000 is this DL00.org that's what the page looked like in the year 2000 This is what the page looks looked like in 2004 and just by eyeballing it you can get the idea This has nothing to do with the digital library conference of the year 2000 This is what the page looks like in 2005, and I don't even know what this is about anymore But the point is it's not about digital libraries. It's not about the conference, right? Since 2000 and about eight the page looks like this and then promised to do some fancy project management for you Believe it or not. So the point is As you all know resources on the web are dynamic not only do that come and go but their content changes Like this one a possible theory is that someone did not re-register the domain IDL00.org someone else took over and Published now content that has nothing to do with the original intention of that canonical website of the conference anymore, so Briefly summarizing the first few points the definition of reference rod is the combination of link rod and content drift That's the first definition. The observation is that those resources are subject to a reference rod, right? They change and they go away The problem becomes or one of the problem becomes when we write scholarly articles and we write research articles and reference to those resources using their URI Because that at the end of the day with the notion of reference rod in mind really threatens the integrity of our scholarly record, right? Scholarly articles are to a large extent based on the references that we trust that can be Revisited after a particular point in time and can be consumed by the reader The problem though is that unlike other scholarly articles that we also reference in our articles the custodianships for these what we call that but large resources like web pages and scholarly wikis and Your project website the custodianship is a completely different one We're talking about web admins that just happen to have a page online That it may not even know that you as the author of a scholarly article reference them another Example and you may be aware of this is This page so when you consider the case where the Supreme Court writes in a report an opinion piece and references a resource on the web just like it just is Alito did in this case and the the custodian the web admin of that web resource recognizes that and tries to Raise some awareness of his case. He changed the content of this page and says well Aren't you glad you didn't cite to this web page in the Supreme Court report and so on fourth If you had like Justice Alito did the original content would have long since disappeared and someone else might have come along Purchase the domain in order to make a comment about the trends of link information on the Internet, right? So that is basically the point you can actually make the case that content drift is the much bigger problem compared to link rod When you discover a rotten resource you see the four four of course, that's a bummer, but at least you know it's gone It's not what you expect to see where you If you see that you you don't know right is that what you intend to to see or not Or sometimes you see it to remember right exactly So another entertaining yet detrimental example of a reference rod in It's got a communication. So there's a ton of link rod studies available one of which is this one our 2014 paper published in plus one or we precisely quantified the notion of link rod and Somewhat in a brute force a top manner Estimated approximated the notion of content drift So the slides are available online. I'll share after so there's no need to Don't need to to a Russian and write down the URI of this of this article so what we did with our Current study and that literally has been published two weeks ago also in plus one we designed a method to Accurately assess and quantify the notion of content drift And come up with the precise numbers of how bad is the problem overall The data set that we used was basically the same as we used in the 2014 study We get obtained to 3.5 million articles from three different corpora from the physics pre-print archive from Elsevier and from PMC from PubMed Central All of these are articles for published somewhere between 1997 and 2012 we had to do some Some conversion and post-processing in order to extract all your rise referenced from those articles from within those articles and those your eyes are all to what we call again We're at large resources. So again, those are not references to other scholarly articles Identified for example with their DUI But those are really references to and your project website your scientific wiki data sets videos all the above right And as you can see in the table and the last row on the bottom row of this table we extracted a bit more than 1 million URIs from these three corpora so a significant actually unprecedented Scale of a data set for this kind of a study Till date to date and of course it will be become important in a second as we keep track of the Publication dates of each and every single article that references a web at large resource So these three graphs are not meant to be read So ignore the axis it's on the fourth it's just my three marketing graphs because the the curves go up right the point is is to realize that the The dotted lines the fine dotted lines are the number of URI references in all three corpora over time So these graphs are just there to support my point Authors increasingly so use URI references to web at large resources, right? So of course You see you know took up at some point later on Archive within the archive corpus it became increasingly important from let's say 2007 2008 on the point Is the trend is there we cannot and do not want to stop that trend, but it's the it's a fact, right? So we need to deal with this sort of a problem Okay, so how do we go about assessing and precisely quantifying the notion of content drift in our Scholarly articles imagine a timeline at some point in time the paper was published a paper was published at references a web at large resources Resource by the means of its URI right so a paper that contains a URI for each of these URIs then we use the memento framework and I'll Talk about this in a second the mental framework to obtain Archive copies of those reference URIs surrounding the publication date of the article that references the URI, okay, and So if you imagine that the paper was published at time T We obtain an archive copy of that URI what we call a memento From time T minus one as close as possible to a T but previous to T right prior and We might obtain a memento an archive copy of that URI Published are created rather after the article was published at time T plus one. So now we really have this Surrounding time interval basically around the publication time of the article and We create or we look for memento you call memento pairs memento pre and a memento post Previous created previous to the publication of the article and created post publication of the article the kind of unique aspect of this is that we are able to use the memento framework and We can look up those We can look for those mementos pre and post in a total of 19 archives around the world it's no secret that the internal archive is the the biggest and The the archive that has contributed most of those mementos no question about it, but there are others In total we obtain 650,000 memento pairs so pairs for 650,000 of our one million URIs So that's good We found the mementos. What do we do now? We need to identify representative mementos with the notion of what did the author intend to cite at the time of the publication of the article Well in order to do that we compare the content the textual content of these two mementos with the assumption in place that if the memento pre is the same as the memento post encompassing the publication time of the article then that is the content that the author intended to cite at the time He or she wrote and or published the paper okay Two examples for this Two screenshots of mementos pre on the left post on the right This URI was cited in in a Scali paper with the URI that I put up on top of the slide The paper was published in August of 2009 on August 15th to be precise the memento pre It's captured from the Internet archive on May 8th of that year the memento post captured on August 27th and just by eyeballing you can see Maybe something went wrong with the capture process, but clearly the content has changed So that's an example of content drift within about three and a half month So there's an issue right It's not all bad though, of course we dug out an example that displays the opposite case another URI that has been referenced by by a Scali article reference there an archive actually published published on July 4th in 1990 97 not 77 my apologies and the memento pre was Created just roughly a month prior to publication date and the memento Post as you didn't put it up there because this is still the version that you can dereference today So the argument here being this URI reference Presumably has not drifted the content of that URI reference has not drifted a bit in 19 years All right, so Back to our comparison of the pre and post memento Two questions arise now a how do we compare the textual content of these two? And be what do we how do we assess? Representatives so how similar do they have to be so that we can call them we can label them as Representative of what the author originally attended to site So we use for very common text comparison Measures some hash your current service and cosine without going into detail here the difference between the four is There's basically two to three major differences sim hash is hash-based so it is sensitive to even the Detiniest of editorial changes a comma here a fixed typo there sim hash Response to this sort of changes and gives you a notion. There's a minor change jacquard and Zürnzen are both based on sets of characters, so if you know Two bills for example of characters change jacquard and Zürnzen would would trigger that would give you a notion of something has changed and Cosine since it's TF IDF base. It's more the contextual Indicator of hey, there's the context really has changed There's a difference in the most salient of terms between these two versions. So that gives us a notion of You didn't just fix typos in these two versions But actually the content really has changed because the most salient terms are not there anymore as I've seen before right So a good spread of similarity measures that in aggregate should give us a good idea of whether those documents are The same or not the two compared documents of the same or not So that answers the first question what similarity measures are we using and why are we using those now the question arises Since we normalize the score output of all these four measures When do we know that they're actually the same the two compared documents, right? So our intuition was ball. Let's go all out, right? They are the same as long as all four measures agree that they're the same. They give the maximum score of 100 And if that would be the case if the momentum pre and the momentum post would score 100 perfect score for all four similarity measures then we would call them as the same and hence as representative of What the author initially attended but we didn't quite feel comfortable with that So we thought we'd go for a sanity check there meaning HTTP as a protocol is able to return particular headers to you when you dereference the resource and that header can give you an indication or some of those headers can Give you indication whether a content has changed between two Between two requests for the same resource So we use in particular the e-tag and last modified headers and compare those headers for the momentum pre and the momentum post And if they are the same then we know that HTTP says these two resources are the same okay, and As a matter of fact you pass our own sanity check, which is always a good thing, right? The last majority of momentum pre and post pairs that are according to HTTP the same Indeed score the perfect score throughout our four similarity measures And hence we're really content with the idea of using that very strict rule Of perfect scores throughout the board, right? All right, so back to our comparison we do compare Moment of pre momentum post we figure out that there if there are the same if they're really contextually the same one and then the text rather is the same and We end up to see 313,000 URIs having a representative me a memento in our web archives Available so given that we started off with 1 million. We're now down to 30% That's the name of the game, right? okay, so This is good because now we know with the level of confidence what the author intended to cite at the time of publication How do we assess content drift? Well We compare the what we call representative memento with the life version of that URI All right makes sense because that's the time we're actually consuming the paper when you're reading it when we develop And we're following the reference and de-referencing the the URI that the author put into the paper Did you go back and we notice because then we have the other problem that link what kicks in again and Only 241,000 out of our 313,000 actually have still have a life version of that URI available So we can only make that comparison for 240,000 URIs Okay, so What we do next is then as I mentioned before the comparison between the representative memento that we determined earlier to the textual representation of The life version of that URI we apply our same for similarity measures to assess that similarity and For the convenience reasons basically we've been the results into six bins by similarity score So I mentioned that the scores are normalized between 0 and 100 the first bin Represents values between 0 and 20 20 and 40 40 and 60 and so on to 4th at the last bin is where the similarity score equals 100 so that's the perfect score, right? So plot it out. It looks like this and I realize it's a bit small to to read but we'll get there so we distinguish now For greater insight basically between our four similarity measures We have a sim has on top left cosine top right to cartons and on the bottom and Our bins our six bins are represented on the on the x-axis of each and every single graph The first observation that we can make from this graph is that the pattern is fairly similar, right? So we see very little dissimilarity Oh, I'm sorry. Yeah, very little dissimilarity and The notion of Perfect score the similarity being the highest between the representative memento and the life version indicated the right most column in every single graph the lightest blue by this shade of blue and the The relative values represented by the line on the right axis. We see for example Sir listen to current basically score something 30% of Perfect similarity score so those 30% of those URIs have not drifted at all and Call science a bit more restrictive Just roughly a bit more than 20% of your eyes have not drifted 30% of sim hash According to sim hash resources have not drifted. So that's the the setup basically per similarity score That's a bit confusing right because it's different scores. So we aggregated all these scores into one graph Looks like this Again, the pattern is the same the shape is the same The right most column in this graph represents the amount of URI references that have not Drifted over time over all three corpora and over all URIs And as you can read roughly and that's the good news of this graph 23.7% of URIs have not Drifted it's good news the bad news of course is then the flip side of the coin right three quarters of URIs have drifted How much have they drifted? Well, that's a very tricky question to answer because there's somewhere between You know these these lower pins What that translates to in terms of how significant that change is we don't go there. We don't know But the the point is that they have drifted to some extent over time, right? the The measures that we apply arguably are Inorganic rather are somewhat crude and they only compare a text, right? So we don't know whether an image has changed for example those sort of things But since we're using a good variety of different textual similarity measures We can confidently say that you know these three out of four URI references have drifted over time This graph is where the acknowledgement to any Jackson comes in because now we're Basically visualizing the same sort of data separated by corpus and also over time so this is the data for archive for the physics pre-print and the Color pattern is the same the lights blue on top is the portion of the fraction of URIs that have not drifted And similarity is a hundred percent was a hundred rather and everything that's black as link rod So again both together make up the notion of reference rod So you can see for example we have Bit more than 20% let's say 30% of URIs unchanged not drifted for the archive physics pre-print corpus And roughly 10% link rod But if you go back only to let's say 2005 It seems like you have roughly 20% link rod and normally maybe 10% of content not drifted which means 90% of URIs are subject to reference rod either link rod or content drift So that's an alarming number right And of course the numbers go get worse the further back in time you go I talk about articles published in 1999 for example looking at an input rate of almost 50% and a content drift rate of URIs and have not been subject to content drift of maybe maybe 5% maybe 10% Point is numbers get worse over time and The pattern is basically the same for all three corpora the The reason why the PMC corpus on the top right seems a bit more Wild let's say is that we didn't get too many articles from PMC prior to 2004 2005 hence the numbers are a bit Maybe be taken to with a grain of salt there We see also the numbers of for Elsevier's actually actually worse in terms of link rod Compared to archive in our 2014 paper. We tried to analyze why that is by looking at the URIs and figured out that else view Articles published in Elsevier increasingly or more so reference comm URIs then for example archive papers do More so reference but edu domains for example that are known to be more stat more more persistent. Let's say Then comms. Orcs you name it All right So there was in a in a nutshell the results of our most recent study Quantifying precisely quantifying what the notion of content drift is in a given our three corpora generated From archive Elsevier and PMC. So what can we do about it? We introduced the notion of robust links Basically as the name gives away We're trying to make links URI references in scholarly articles more robust So now you can say well, isn't that what do eyes are meant to do for us and we say not quite So as you know, do eyes were designed to combat link rod for links that point to other scholarly articles So the design principle is slightly different in addition. It relies fully on the custodians of these do I identified resources and obviously they're Motivated strongly motivated to maintain links to their content because usage of their content is at stake All right, so they're strong incentives for such custodians like publishers to maintain Do I references so they will be Mapping to the new location in case the content has the location of the content has changed and so on to fall the custodians as I alluded to earlier the custodians of our web at large resources, however, are Motivated by completely different factors. They're typically bad men's and not scholarly publishers They don't have such incentives. They don't necessarily care for the longevity of their website yet alone the integrity of the scholarly record So, you know, there's a different Motivation behind that So from our point of view really the notion about the problem of reference rod is a is a problem that is largely rooted outside the scholarly communication Community, but we argue that it actually needs to be solved by by us by this very same community So what's the current state when we reference your URI? When we use URIs to reference that with large resources, well, we're guilty as charged because in our 2014 plus paper We referenced URIs like that included the original URI to for example the hyperlink project hyperlink.org and We kind of took the easy way out and Said oh, this was last accessed, you know on that particular date And now we are good right because we promised that what we saw on that particular day Just really what we meant to reference well neither for link rods nor for content drift This helps in any way really, but that's the current state. So we are getting for basically two steps. The one is Use the URI that you're referencing and create an archival snapshot create a memento proactively You can do this as an author when you're writing your paper You can do this as a conference or journal submission system when you receive the paper You can do this as a publisher when you're publishing the paper You can do this as an aggregation service like core for example when you're aggregating the articles from someone else Obviously you want to get as close as possible to the authoring process, but you know There's several different stages where you could potentially do that step We've seen several archives that can do that the archive is one institution where you can ask them Hey, please go and grab this URI for me archive it for me promises C has been trying to focus on the notion of archiving references for scholarly articles But citation has been around for a while for a very similar purpose archive.is is also another one of those proactive Archiving service that you can ask to archive a particular URI for you and For all these Form in this case for all these four services You will get immediately back the URI of the archive resource the URI of the memento Which you can then use for learn as we will show so have we done now that we solved the problem that we archive something Well, no, of course, we are not this example Gives you it's a reference from a Wikipedia page Points to Excuse me an archive snapshot in the Internet archive, and it tells you when it was captured So the problem with that of course is that this sort of an approach relies on the continuous Availability of that particular archive. Well, of course, we hope it's gonna stay around right no question about it But we've seen cases of archives disappear. I do remember mummify.it Doesn't exist anymore web citation has been struggling in the past with support and so on to forth We know that the Internet archive is not accessible in all countries. All right, so there is We're basically replacing one link or problem with another if you rely solely on this Sort of an approach but just replacing the original URI with the URI of the memento That does not solve the problem for good So the second step that we are arguing for is decorate your links So when you're writing your paper your first step was as we mentioned was to create an archival copy of your URI of your reference And decorate that link in your paper with the original URI. So don't throw that thing away with the URI of the memento of the archival snapshot that you created and Indicate the archival daytime of your of the archival copy that you created So for these three pieces of information you basically create a really good full-back method mechanism meaning if your Original URI doesn't work anymore. If that's subject to link rod You can always go back to the to the memento to the archive copy if the archive is In it unavailable is you know the accesses down or the archive has stopped to operate You can always use the original URI to query other archives. Do you have a copy and? That combined with the archival daytime the third piece of information that we asked you to decorate your link with But that information together you can ask any archive for a very appropriate copy of Of that URI so basically give me a copy of the URI at time x time x meaning the publication date of the Article or whenever you created your reference there So several full-back mechanisms preparing for the worst case possible that your original doesn't work anymore that the archive is down and Other archives are hopefully still available where you can look up a copy of your URI All right So it could look like this then it's another reference from a Wikipedia actually from the very same Wikipedia article Where you would provide the original URI? Archive from the original You provide the capture daytime of that URI and you provide a reference to the archival copy that you have created Proactively as an author for example, right so they again these three pieces of information from our point of view are essential to make links more robust So what could this look like a regular link a little bit of HTML not doesn't hurt to CNI.org the canonical website of the CNI and we can use standard HTML 5 attributes to convey these three pieces of information All right, we still convey now the original URI We use the data version URL attribute to convey the URI of the momentum of the snapshot that we created and We use the attribute data version date to convey the date time at which I created that archival snapshot and close our tag so that's the the only thing you need to do as a For example as an author to make your link more robust to decorate your link for the necessary information to make it more robust We call this approach robust links. We wrote a specification document for the URI on the bottom of that slide and last year Herbert was fortunate enough together the Michael Lancet you publish this paper then they actually talked about this at CNI last year in the fall and We're able to convince the dealer package in to implement robust links So it is an as you all know, I'm sure it's an online only sort of an publication venue it's HTML and we ingested a little bit of JavaScript into the page and Results, I'm not sure if you can see results in this little anchor symbols right next to the link That you can click on and get this little pop up in the center of the screen that gives you three options for that link you can go to the To the archival snapshot of that link you can ask the memento framework to do some magic for you and go to the Link in any archive given the link date. So the archival daytime or and that's another approach that you can actually do you can leave a The creation date of that page you can go to the link as it was at the creation date of that page as well. So again, the point is Link decoration is easy. It's based on standards. We use HTML five attributes. That's not rocket science browser supported everything is good there and This is one approach with a little bit of JavaScript to make those links actionable and way more robust than before All right, so I realized that I flew through this a bit Quicker than I intended but I basically have four take away messages for you. The first one is as we've seen Scholarly articles increasingly contain your rise to a bit large resources, right? We all have been there We all wrote in our writing scholarly articles. We all use your right references to Resources that we find on the web that we intend to Convey in our message to support our point for example So that's all a significant part of scholarly communication and the trend is increasing So more and more authors use more and more you're right references to very large resources. So that's the first thing These resources, however, and that's our second observation there are subject to reference right there Not any different than any other web resource right there just regular web resources hence They're subject to the same dynamic character of the web. So we see notions of Lingrod Observe content drift. So there's nothing special just because you reference and URI in your scholarly article doesn't make the resource any more robust per se. So we need to do something about that The custodians of these resources are typically not overly concerned about a long-term preservation long-term access and integrity fixity of those resources Hence we need to do something about this and we can as publishers as authors as third parties I mentioned core for example can do something about this We can take action to proactively archive our resources because certainly we think that the integrity of our scholarly record is Worth it and the price is not all that great So for that I stopped left intentionally some time for questions. Thank you so much for listening and I'm happy to take your questions