 Thanks for coming to this session. I'm presenting today with Martin Klein and Michael Nelson. And this is actually the first presentation that we're doing about a Mellon project that we are involved in that we have started to refer to as the scholarly orphan project. And this is the outline of the presentation. I'll start with the problem statement and project perspective. And then Michael and Martin will look into aspects of well explorations that we're currently doing in search of a solution of the problem. And I'll start with the problem statement itself. The context of the project is the rapidly changing nature of scholarship online with the consideration being that increasingly the research process and not only the outcomes of research are online. So we see researchers basically dropping artifacts all around the web in various kind of productivity portals you could say. So extension of the scholarly record with a wide variety of new artifacts. And interesting in all of this is that these artifacts are in many times actually dropped in portals that are not necessarily dedicated to scholarship. I'm talking things like Github or SlideShare and so on. So they're with all over the place. Some of these portals are for scholarship specifically but others are more general and are also used by scholars. Like to mention that people in the Netherlands actually have started to create a registry of these kind of tools. It goes under the name of one of the innovations in scholarly communication. And so they basically is all these new kind of portals that researchers are using as they go about their business and they categorized them like in writing analysis, discovery assessment, outreach, et cetera. A couple of CNIs ago we had a presentation from OCLC on this notion of the evolving scholarly record also. So this is also part of the context that we're looking at. And here basically OCLC was considering well what is really the scholarly record, you know, where how you delineate it. And they came also from the perspective of archiving this new scholarly record. And whose task is it actually to archive all those new types of materials. I've myself done quite some thinking in that real over the past years also. And for example I did a paper for IPress with Andrew Treeler. He's at the Australian National Data Service. Again where we consider this fact that a lot of common web platforms are used for scholarship like GitHub and Wikis at WordPress. And there's reasons that researchers are using this because these portals have attractive features like versioning, timestamping, social embedding and so on. But then we basically say in this paper but be careful these platforms are actually recording scholarship. They are not archiving it. And in order to illustrate that its surface is actually to look at the terms or part of the terms of GitHub. And you know I'll read which you get a preserved right at any time and from time to time to modify discontinued temporarily apparently the service of partner without any prior notice. This is not an archival service that's obvious. Right. To make another point you probably all remember Google code which at one point was really the repository for software collaboration. Well since 2015 it's gone. Everyone uses GitHub now and what will be next. The point is longevity is not necessarily part of the business model of these kind of portals. So Andrew and I have kind of categorized what we see as the difference between on the one hand recording that is being done in these productivity portals and archiving. And you know short term no guarantees provided for the long term. Write many read many and it's about the scholarly process. Where on the other hand archiving for the scholarly record is obviously longer term. There are attempts at least to provide guarantees of longevity. It's right once read many and this is once you've archived it it actually becomes part of the scholarly record. Throughout the presentation Martin and Michael will actually use these two collico's of ours as examples of people that are actually using these productivity portals online. One is the Amilligan historian but uses a lot of web archiving technology from the University of Waterloo and Mark Machenzo from Stanford University. And when you look up these people then you see indeed that they have various web identities we call them. So there's a homepage, uses slide share, uses GitHub. And obviously they leave scholarly traces in all these environments. And Mark also has a homepage, also slideshare, also GitHub. Also uses the open science framework and actually contributes code to the Drupal repository also. So again those are all traces that are leaving around the web. The problem as I mentioned is that these platforms are recording and not archiving. And of course we have web archiving activities going on. But this is of course only anecdotal evidence. We look here at one of Ian's slide share artifacts and we use the time travel service that looks across all public web archives and it's basically this thing is archived nowhere. It's in no web archive around the world. The situation is a little bit better for this GitHub repository of Ian where we actually find exactly one copy of this repository in the Internet Archive. For those of you that are familiar with the Internet Archive the typical pattern is that you have a lot of bars here meaning stuff is archived a lot. As I said this is anecdotal. Out of the hyperlink project we actually know that what we call web at large resources are generally speaking very poorly web archived. So that's really the problem that is at the basis of this project. It's a mal-funded project called the Orphans collaboration between Los Alamos and Old Dominion University. And the problem that we're really looking at is how could we capture these artifacts that are left in all these different portals so that they could be archived preserved. And we take a paradigm that's inspired by web archiving one because of the scale of the problem and also because striking bilateral agreements with each of these portals in order to allow back office archiving is probably not realistic. And then we also explore a paradigm that's institution driven in the sense that we consider it the task of the institution to look after its own researchers and to figure out where on the web they are and where they trace and then go after them. That doesn't necessarily mean that these institutions have to do that themselves but a subscription service also could do it on behalf of institutions. So this is a very high level picture of how we're looking at the problem domain. There's an institution with several researchers and these researchers have identities in these productivity portals on the web and in each of these portals they are creating artifacts they are leaving their traces. So basically from the perspective of our project these become candidates to go capture in order to then be able to archive them. You will undoubtedly recognize similarities with some other projects out there. Clearly LOX which also takes web archival crawling kind of approach but focuses on journal literature. We explicitly are not looking at that material because other people are looking into that. Archive it which is the on-demand service of the internet archive again because subscription based grabbing of materials but clearly not focused on the scholarly orphans that we are after. To an extent institutional repository right but with that difference that in institutional repositories scholars are actually supposed to upload their stuff that's not what we are about. We are about automatically trying to grab it and of course again institutional repositories focus on the journal literature not necessarily on these materials that we're discussing. The closest similarities maybe this thing I came across called the locker project it's I mean it doesn't exist anymore actually was active about five six years ago. This was a project to capture the web presence of individuals by basically interacting scraping etc. from different portals around the web. It didn't have a scholarly focus it was more about convenience you leave your pictures in Flickr and your slides there etc. and let's now bring them all into one environment under control of the user. That was the locker project. So this is the flow that we are currently exploring to try and look into this problem and again Martin and Michael will go into details so I'll only explain at a very high level. There's four steps in the chain that we are considering currently. First is discovering these web identities of our scholars. So from an institutional perspective how do you figure out what all these identities are of your people in these portals out there. Second step once you have discovered these identities go find the artifacts that they attach to those identities. Typically when you have these things that you put in those portals it's not just one you arrive because many of these things have landing pages and underneath there are other resources that relate to the object. So there's this notion of what is once you found an artifact what is the boundary how many URLs do you really need to capture in order to get the essence of the entire artifact. And then the last one is of course to actually go and capture the materials for each of these URLs that sit under a certain artifact. And with that I'm going to leave it to Martin. Thank you Herbert. So I'll start by looking into step one of our capture flow. There in particular we're exploring two approaches. We explored one the algorithmic approach in the past and now we're focusing on the exploration of a web identity registry. So the pointer to previous work then algorithmic approach of identifying web identities was published in Code for Lib. We called this system an ego system that basically used a list of postdoc scholars from Los Alamos National Labs and some additional information for example their institution that they graduated from and used that information to feed into APIs like search engine APIs linked in those sort of things to discover where would they what would their web identities be and build a nice graph around it and so on and so forth. I highly recommend the Code for Lib paper where we detailed the implementation of the entire system. All right so our current approach focuses on registries and the first registry that may come to mind for trying to identify web identities is ORCID. Again we take our two sample researchers here Ian and Mark both of them have an ORCID profile and if you look at for example Ian's ORCID profile this might be hard to read in the back. You'll see a couple of information points about its education, its employment history, some funding information and some works mostly published papers but we'll see zero web identities. You will notice on the left hand side where they would usually occur as you'll see in a second it's blank it's white right. So no further references to identities of Ian in other scholarly portals. If you look at Mark's ORCID page on the other end we see similar information about education, employment and so on and so forth but on the left hand side you'll notice we see three web identities that Mark left in his ORCID profile. One being a reference to his professional web page, his own web presence and two references to Scopus and to the researcher ID where you could potentially find further publications by Mark. So if we now take the reference to his personal web page and follow that reference we'll land of course on his site and there by browsing a little bit around we'll be able to discover to further your eyes to web identities of Mark namely his Twitter account and the reference to his GitHub account. So a very manual approach in this context is to make the point that web identities could potentially be discovered in this way. So I'll be briefly talking about an experiment that we conducted in the past and I've written down in a paper that will be published later this year at JCDL and a pre-print is available where we really evaluate ORCID records for the sake of discovering web identities. The question that immediately comes to mind well how well do ORCID records represent the the community of researchers at large and we approach this question by from three different angles we're looking at the adoption rate of ORCID records we're looking at the subject coverage of ORCID records as well as the geolocation coverage of ORCID records and then last but not least we're trying to answer the question now how well do ORCID records actually do for the discovery of these these web identities. So it's not a secret that ORCID has has seen an increasing uptake for for researchers they bars on the on the far left but in dark blue over over time from the four ORCID data dumps that you get are increasing right so we're now at in 2016 we're at two and a half million ORCID records you probably have seen that they just passed 3.1 million records and so on to fourth so that's a nice little trend you'll see in dark red the fraction of ORCID records that actually have works information so your publications for example the layer red shade is representing the fraction of ORCIDs that have affiliation information attached to it and in orange the the number of ORCID records that actually hold web identities such as the your right to the web page the personal web page of mark as we've seen a couple of sites ago you'll see that the fraction is on a shallow view not great right so in 2016 we have 26% of ORCID records that actually hold affiliation information 20% who hold works publication information and just above 6% only hold information about further web identities so I mentioned that we're addressing the coverage notion from the three different angles first angle is the subject coverage so how well do ORCID records and the information left in ORCID records represent the scholarly landscape at large for details of this graph I refer to the paper however you'll see ORCID records in blue and overall publications in the US held in red and the first thing that comes to to mark you observing this graph is that the subjects are dominated by life sciences right that may or may not be surprising to you but subjects such as the humanities are almost not covered at all in ORCID profiles the other aspect that comes that you can observe from this graph is if you compare the ORCID records the subject of ORCID records to the subject of PhD researchers that have graduated recently in the US you'll see for example in the field of engineering there are many more PhD graduates in that field than there are publications left in ORCID records so that seemed to be a notion of under representation for kids there so the point here is the subject coverage seems to be okay but it's fairly focused towards the life sciences and other subjects maybe under represented all right how about the geolocation coverage these are the top 10 affiliation from ORCID records the most recent affiliation from ORCID records and two things are important there to note compared with in the law and on the right hand side with the the fraction of researchers worldwide the distribution of researchers worldwide as reported by UNESCO so two things important there first is the fraction of US ORCID records if you will compared to the fraction of US based researchers is is pretty good right roughly 17% on both cases for ORCID and the real world the other thing that is obviously that ORCID researchers with ORCID IDs from China from Chinese institutions seem to be underrepresented in ORCID if you compare that to the almost 20% worldwide and that holds true for Japan and for Russia and for Germany and so on to so there's a little bit of a not quite a balance right so in terms of web identities the references to home pages for example what do we find there in all our ORCID records that we looked at the problem here as you'll notice is this is these labels for references to other web identities are free text so you as an ORCID researcher you can type in whatever you'd like to label that you're right which then automatically results in you know lead in and lead in profile as two different categories very active should be the same right however the point is with very low percentages things like references to LinkedIn research gate academia edu google scholar and personal web pages seem to be the dominating fraction of web identities that we find in ORCID records all right a brief summary in between of all these findings the adoption rate of ORCIDs as you've seen in the bar graph is increasing which is a good sign so that's a that's a good direction that we're going and we're happy about that the subject coverage as I've mentioned seems to be fairly focused towards the life sciences disciplines like the humanities and engineering seem to be slightly under represented the geolocation as you've seen as well seems good as long as you cut as you look at the US coverage only on a global scale worldwide it's not quite representative and the web identity coverage seems to be fairly poor and as such taken ORCID only as an registry to identify web identities may not quite be applicable not not yet there basically right okay step two if once we have discovered our web identities how do we discover the artifacts that belong to those web identities and there's basically three approaches that we're contemplating one is an algorithmic approach going back to Mark's personal web page we could find his page listing all his presentations from their own there are slides linked to that their audio files video files for presentations linked to that so that's an algorithmic approach by scraping in smart manner potentially automatically those artifacts from the web page so that's one algorithmic approach the second approach is something that Sligcher offers it's a notification mechanism so you could register for Sligcher and you can ask the service to notify you whenever a researcher of interest has uploaded a new presentation for example and the service would notify you actively that's the second approach and the third one is again using artifacts on those registries to using the artifact registry I'm sorry such as Orgud for example to to to get those artifacts we'll see the Orgud page of Mark Marienzo again not only does he have a total of 12 works listed in this profile five of those are actually artifacts of interest for our use case because there are now standards documents reports book reviews things that don't necessarily have a DIY things that don't necessarily are covered by locks and clocks and the like okay so these three approaches is what we're considering for the discovery of artifacts once you have discovered the artifacts how do we as Herbert mentioned how do we determine determine the boundary of those how do we know where the artifact of interest ends and where the cat video starts kind of deal right so there are also two approaches we're considering our focus on the sign posting approach okay but the nice the logo and signposting.org is the the website that I encourage you to go to it is an approach to make the scholarly web more friendly to machines how do we do this we're proposing to do this with the HTTP links and registered link relation types so if that doesn't mean anything to you let me explain by an example imagine two URIs two resources resource one is identified by URI on one and resource two is identified by URI two if resource one let's say is a as a metadata record that describes let's say a scholarly article identified with URI two you could link these two with a registered link relation types one describes two right if you then send an HTTP request to URI one you would see in your HTTP response you would get a link header that says I describe URI two so that's a fairly simple easy to implement approach to convey relationships between these resources an approach that a machine can understand and interpret so if you apply this paradigm to particular patterns certain patterns that repeat in the world of scholarly communication we could come up with an example like this where we're trying to describe the boundary problem so you imagine you and you browse to your landing page and this landing page has several outgoing links to several different resources how does a machine now know which resources are relevant to that particular landing page that you may have gotten to well it could communicate that by implementing links to the relevant resources let's say the publication and html format the publication in its pdf format and some supplemental information that you can imagine and if you do so by saying all these resources are relevant to me they belong to me they are my items right and the other way around of course works as well those individual publication resources could lead back to the landing page and say I belong to that collection of that landing page right so there's a bidirectional aspect to this as well this of course is a fairly simple example you can imagine much more complex examples I won't go into detail here but the notion of a DOI representing sorry redirecting to a landing page and the notion of describing where my metadata records are that describe the thing that's identified by the DOI and the other way around as well so signposting has been motivated by the use case of LOX trying to find the relevant resources that are to be archived for machines and so it's somewhat motivated inspired by the preservation of journals in that realm however there's no reason why this could not be applied to the scholarly portals that we talked about previously right so technically there's absolutely no reason the question of course arises how do you motivate these portals to adopt those those technologies we see early uptake so the University College Dublin library has adopted this sort of approach data site has adopted it and so on so forth so there's some early success there which is encouraging so with this I think I'll hand it over to Michael who will detail step four in our process thanks barn all right we can use the clicker here didn't work for me didn't work okay all right no clicker all right so some of the challenges for capturing these web artifacts right we really have sort of two main challenges first one is the legal challenge what do we do with it and the technical challenge how well do our tools work can we verify in scale how well these tool works and can we verify now authenticity we fully expect that we can address this second issue the problem of course is with the first issue as you might expect it's real mess so when you look at some of the popular tools that people are using we look at the robots.txt and it's sort of a mix match of things so for SlideShare and GitHub they have a very complicated robots.txt some of these things can be preserved some of them can't Drupal seems to do mostly the right thing but on the other hand in this case they have the license telling you don't do archiving right obviously we ignore that but you know I'm not a lawyer so probably don't listen to me this one that I hope in sign framework doesn't want to crawl inside it's not really clear what you can do with this content all right but assuming we put that aside if we look at how some of these artifacts look and popular web archiving tools so right here we have a SlideShare presentation to mark and how it looks on the live web then here it is in web recorder and here it is the internet archive quickly you can't really tell the difference so the idea here is in this example all the tools do a pretty good job then we go to a GitHub and I think this is from Mark's GitHub so here's the live version web recorder does a pretty good job you can't really tell the difference in here but clearly something's going on with the internet archive turns out a style sheet that's missing and it rearranges all of the content then we go to the open science foundation and here's the live web version and this is Mark's identity here and web recorder does a bad job and the internet archive does a bad job but they do different bad jobs right so you have that going for you so we you know this is right now we look at this and say okay these are all banged up it didn't really do a good job now we want to automate this so we don't have to look at it and say it's banged up so one of the things we're doing in this project is continuing some research we started with Justin Brunel and it was first published at JCL a couple years ago and the idea is how can we quantifiably measure how well pages are archived at scale and the idea is we don't just want to say okay we've got nine out of ten embedded resources so it's 90 percent because not all the things that are missing are going to be of equal weight so if you have a youtube video and the video is missing that's one resource and you got all the little gifs and whatever but clearly the main thing that human from a human perception yeah it was not well archived so we did some preliminary experience we manually damaged some pages we showed them to mechanical turker workers we had them ranked from like most of the least damage and so forth and we came up with some heuristics to attach for when things are really missing of course the problem is if things are really missing you don't know how big they were right so you go back to 2005 you replay a momento it's missing something but it's not clear what it was actually doing with the page one of the unexpected results is style sheets from a user's perspective are actually super important so from our perspective we said well the content's all there it just rearranged funny but it's there you can deal with it the user's perspective was no it's damaged it looks ugly i don't like it then we had two tracks we decided we could try to educate all the people in the world about that and then we decided or we can adjust our weights and say if you're missing a style sheet it's kind of important so we actually came up with this trick and the idea here is we have this uh the local newspaper uh north of newspaper and this is the left hand side of how it appears on the live web and here's a memento where it's missing a style sheet now again all the text is there but then if the style sheet is missing we divide the page into thirds and the idea is you have non background colors more or less equally distributed across these uh these third columns so here it is in the live web 33 percent 26 29 percent that's normal page design if it's missing then we actually get scenarios like this where everything gets shifted to the left because there's no style sheet moving things out to the right and so basically if you have more than 75 percent of your non background colors and the left two thirds and nothing over here then we're considering that page damage now if you just have an ugly page and no style sheet is missing then you have an ugly page and we don't penalize you for that so that's one of the tricks that we do now in Justin's work what we ran into is we had a test library uh it worked okay for if you knew exactly what you're doing but we have a service that's nearly ready for prime time we have some URLs here uh maybe don't tweak them so much because we're still we're still tweaking with it and so it's almost ready but the idea is there's three ways to interact with this one is there's an interactive service to make it friendly to to plug things into so you put the url of a live web page or more interestingly a memento and it goes through and applies all the techniques that we've had and gives you a final score ranked from zero to one um that works well for onesy-twozy kinds of things but there's python library and a docker image for you to download if you're going to run it for a hundred thousand mementos all right so what we're going to do is look at some of these examples right here so the first thing that we have is a memento at the internet archive this is of our departmental home page and you can't really see it but there's a little teeny image that's missing there and you know who cares right so the damage is very small it's a 0.06 and not a lot of damage now what are the units on there they're they're missed cool damage units doesn't matter right it's for ranking mementos with each other in this case we have some I forget what the web what the web pages it's missing a single image as well but this is large and it's centered in the viewport that was probably more important maybe it was just a logo we don't really know but the damage increases more significantly because intuitively this image is more important to the presentation of that page then we look at the third example and you have all kinds of problems right basically all these images are missing they're big they're important and so we adjust the weight together now we still have adjusted upward we still have all the text but maybe those images were saying something important all right so when we apply this technique to Ian's github memento this is what it looked like on the live web this is what it looks like in the internet archive we plug it into the tool and we actually get a small amount of damage because I think it's just barely this gray bar and some of this stuff hanging out here I think it's just barely missing that 75 percent in the two-thirds range so in this case it squeaks out with not a lot of damage even though it's not exceptionally pretty but on the other hand the native github interface is not exceptionally pretty either so I don't know all right so that was about verifying the quality of archiving at scale again almost ready for print time the next thing is about verifying the authenticity of what we have in the archive now the issue is when there's only one archive we implicitly trust Brewster in the internet archive that he you know unless he's running a really long con we don't have to worry about him going in a hacking phase but as we move to an environment where there's going to be hundreds if not thousands of archives we really have to worry about you know are you going to trust the Breitbart archive that pops up at some point right so just hypothetically if you had a nasa.gov web page that talked about parts per million of carbon dioxide and then 10 years of course this page will surely go away right and then at some point you can't say this is labeled Michael's evil way back so I'm presenting this to you as a principled scholar but I'm a gun for hire I will work on the Breitbart archive I'll knock it out so I'll go in and I'll edit this and it says 275 parts per million don't worry everything's fine there's no such thing as climate change but five years from now the nasa.gov page itself is unavailable how are you going to resolve who has what observation right so again fake archives are an issue that we're going to have to deal with in the future right right now we're at this luxury case where not many people are running archives eventually there's going to be economic incentive to run archives and then we're going to have a huge problem all right so our approach to that and this is really early we're just working out proof of concept and so forth in this case we're taking a pdf to keep it really simple so this is our original resource and we're going to push copies of this into all the public archives that we can think of right so right now it's going the internet archive archive today web citation you could push it into permacise and all the other ones that are available so then we come and we compute fixity information on this now in the future it would be nice if the way back machine would compute fixity as it ingests things but for the moment we're essentially replaying it immediately and then observing the fixity information that comes back and then we make a manifest of this information which has the specific uri and how you replayed it so you could get back the original content not the process content and then the date of the observation a whole list of hash values that you've got and not shown in here is a list of how you computed those hashes because there's a million ways that you can combine it so now we have a manifest file we put that on a server and publish that on the web and then you feel what's coming next right so then we take this manifest file and then we push that and to all the different archives that exist so now we have essentially a list of the fixity information pushed into in different public web archives all right so now we've got a bunch of copies of things making assertions about each other we have the original resource we have in number of mementos about those resources we have a manifest that talks about i observed that this memento at this time returned this fixity information and then of that we make lots of copies of that file as well and the idea is hopefully some of these are going to survive in the long term so how do we authenticate that so the idea is you're going to come across a single memento and even with a browser add-on or you're going to use a python library do this at scale or whatever given a memento we're going to have a well-known place where you can look up the manifest file for that memento and then we're going to discover all the mementos in that manifest and then verify the integrity of the manifest file itself using something called trusty your eyes you read more about how we're doing that here and then for that we're going to go through and then recompute the fixity of that information to discover whether or not that page said what it says now if it said that at some point in the past then we can take a vote and if you can believe that these are independent archives that's a whole separate thing right because mementos and mementos not necessarily the same thing as an archive then we just do a majority vote and we have a video that we can tweet out later that actually covers this scenario and finds that in Michael's evil archive that the nasa.gov page has been tampered with and the other mementos have not been tampered with all right so i think that's the end of our presentation at this point where we take your questions