 Okay, I think it's time to get started. Welcome everyone. I'm Cliff Lynch, the director of CNI, and you found your way to the CNI virtual spring 2020 meeting. We're now moving towards the end of the second week of the meeting. Today we have a project briefing on the persistence of persistent identifiers on the scholarly web. This is work that Martin Klein has been doing with his team at the Los Alamos National Labs and it's part of a series of studies that he has been doing around the properties of the scholarly web and identifiers and the stability of these things over time. Martin will be talking and he will take questions at the end. You can put in questions at any time using either the chat or the Q&A tool at the bottom of your screen. And when he's done, Diane Goldenberg Hart from CNI will be moderating the Q&A part of the session. So with that, Martin, welcome. Thank you for doing this and over to you. Thank you very much Cliff for this kind introduction and thank you to CNI for making this happen. Thank you all for being here. I really appreciate you being here and joining me in this session today. As Cliff mentioned, my name is Martin Klein. I work at Los Alamos National Laboratory in the Research Library there. And what I would like to do today is in this session is give you a brief overview of our series of experiments on the persistence of persistent identifiers on the scholarly web. In particular, we're testing digital object identifiers or DOIs and how consistently or inconsistently scholarly publishers respond when DOIs are requested. So before I start, I'd like to refer you to our preprint that my colleague Luda and I recently published for more information about the topic for more background information, more details on the methodology and the dataset that we generated and the methodology we applied and also for more results of these experiments that I'm able to show today. So please feel free to check out this preprint for further information on the topic. So to motivate our work, I would like you to imagine three different scenarios for yourself. The first one is you find yourself in a park, like the one visualized or displayed here. You find yourself in a park. Unfortunately, you fell and you broke your leg. So you find yourself in an emergency situation. However, you're lucky in a way that you have your iPhone on you. You're not out of battery and you're within cell phone reception. So you actually can call an emergency service 911 in this case and you do that. This friendly fellow answers your call and responds to your request in a very professional manner so that you eventually will be able to receive the help that you requested and the help, the proper help that you need. So that's a great given the circumstances, a great outcome of that scenario. Secondly, please join me in imagining that you're in your office, a different environment. However, you again tripped over a while, let's say you unfortunately broke your leg, but you can make it to your office phone that looks like this. And you're again able to call the emergency hotline you call 911 to ask for help. And for this time around, this guy responds. And that's obviously a different sort of response certainly less professional. And as a result of that, some sort of help is initiated, though probably not the help that you need right what are you supposed to do that thing when you have a broken leg. So the third scenario, please imagine you're at home. So you had a different environment. And you find yourself in the very safe emergency case, and you can access your landline. Yes, you probably missed a few upgrades of the years, regardless, you are able to call 911. However, now in this case, no one is picking up your call, no one is responding to your call. And consequently, there is no help coming for your case. So now you may ask, well, so why am I asking you to imagine these scenarios. Well, wouldn't you think that regardless of the location that you're at, or the phone used when calling a well known number that uniquely identifies an emergency resource like 911. Would you not expect to get the same response. Or in other words, do you trust and if so why do you trust the persistence of that number and the corresponding response. So what do you have to do with scholarly communication and persistent identifiers. Well, we can apply the scenarios to digital object identifiers to do is in a way that we imagine phones being web clients locations being network environments and 911 calls representing threats against the eyes. So I can rephrase the question asked on the previous slide in this context now and ask, regardless of the web client and network location, would you not expect the same response from the web server when requesting the same So this led us to to our study, where we're conducting an quantitative investigation basically off Skoll publishers responses when requesting do eyes with very common HTTP requests. We use different web clients and different request methods in our experiments, and those methods resemble on the one hand, machines browsing or if you will crawling the web, and on the other hand, more humans browsing the web. We also conduct our experiments from two different network environments on the one hand, the Amazon cloud and AWS instance. And on the other hand, the Los Alamos National Laboratory internal network. So as you can imagine, these two networks have very different levels of subscription or license agreements with commercial publishers. So our intention here is to test the consistency of do I responses, because after all, without a level of consistency, how can we trust the persistence of such identifiers. All right, so how does this work. Well, we all know what this is. This is the do I, and if I paste this do I into my browser, my browser tells me it's negative 88,579. So this is not how it works, but we can make those device actionable on the web via HTTP. And if I paste this actionable do I into my browser I eventually get the resource that I expect. However, behind the scenes sort of things somewhat opaque to the user there's more to this do I resolution process. If I do reference this do I is actually an HTTP redirect happening to a different resource in this case, hosted by springer, which is the publisher of this resource. So, but this is not actually the end of the redirect chain, springer establishes another redirect to yet a different resource within springer, because they figure out there's now HTTPS we should probably serve our content over that by that mean. However, there's also another resource that we can redirect to so you see there's an entire redirect chain involved in dereferencing the single do I to out of which in this case are on the end of the publisher on the end of springer. Right. So an entire redirect chain per do I. So now we're asked ourselves the question how do we assemble a corpus of do I is that is somewhat representative of the scholarly communication landscape. And you will probably believe me when I say this is not trivial. And in the interest of time I won't go into details of how we did that. For now it just suffice to say we assembled a corpus of 10,000 do I is that we believe are somewhat representative of the overall scholarly communication landscape, and I refer to the preprint mentioned earlier for details of how we generated this course. So we have 10,000 do I is that we can dereference how with what methods and clients to be dereference them. Well we have four methods. And the first one is an HTTP head request. If you don't know what that is it's basically a lightweight sort of a ping against the web server. And then there to which the web server response with response headers only the web server does not respond with a response body against this sort of request. We utilize the popular command line tool curl to send this sort of request. And I visualize an example of how a curl request head request against the do I would look like in this case we have redirect chain actually have only two links. So the first link in the redirect chain redirects with a 302 the response code 302 highlighted here to another resource that eventually returns a response code 200 meaning okay and this is the end of the redirect chain. That's sort of a request. It closely resembles a machine browsing or if you will crawling the way I would argue that only very deaky humans would ever send those requests on a regular basis. So the second request that we're sending we again use our command line tool curl but we send a get request. What's here is that the server response with the response body as well as the response headers. And I visualize as example here in a very simplified manner, indicating the response body being returned as well. That also is a request that very closely resembles a machine crawling the web. They're not a lot of humans that send these requests on a regular basis. So the third request is an enhancement to the second one actually it's again a get request it's again with curl, but we do enhance our requests with the number of optional parameters. For example, saying that who we are in terms of conveying a user agent, we accept cookies and so on to fourth. So that is a little bit of a hybrid between the two, but it's still more closely resembles a machine crawling the web than it does a human. Last but not least our fourth method is using the popular browser web browser Chrome, and we can use that browser at scale by remote controlling it basically with a Selenium web driver. And since this is a real full flash browser, this de referencing of the eyes very closely resembles humans browsing the web. So as a bit of a site comment, the document that standardizes or describes the ins and outs of HTTP RFC 72 31 states that a server should so a very strong suggestion should send the same response headers regardless of whether the request is ahead, or whether the request method is a get right so regardless of what we send, we should expect the same response headers. And as you've seen in the previous examples, part of those response headers is the response code 200 300 and so on to fourth and just as a reminder for everyone. There are a number of response headers. I'm sorry response codes that we can observe the 200 level response codes indicate success of the HTTP transaction 300 level indicates some sort of redirection so there will be another link to be followed to the 400 level response code indicates some sort of an error happening on the client side and the 500 level response code indicates an error on the server side. So let's start and look at some results of our experiments. This visualization shows the results of our full methods de referencing all 10,000 do is we follow the entire redirect chain per do I and record the individual links of that redirect chain and visualize here the response codes of all final links of those individual redirect chains right. Note that this is the experiment as conducted from an Amazon cloud instance so it's our easy to AWS instance so the external network if you will. The four methods are visualized on the x axis and the 10,000 do is visualize on the y axis and the individual colors convey the HTTP response code level so green indicates a 200 level response. Gray indicates a 300 level response so redirect red 400 and blue 500. I don't believe we have an error in this graph. So there are several different observations that we can immediately make from a visualization like this. The first one being that less than 50% of all our do is return a 200 level. So success responds across all four request methods. That inner by itself is a very sad statement, very sad statement. Because the other way around right more than 5000 of our do is do not respond consistently to all four methods that is very some. The second observation we can make is a significant portion of the eyes that do not respond well when the simple get method is used. 200 level responses should not be the final response code of the redirect chain right because by definition and redirect indicates that there's something else to follow to this 40 more than 40% of those do is respond to that level of response code for simple get methods. There's no obvious explanation for that, especially given the next observation that a significant portion of those do is we return a 200 level response code when the head or the chrome method are used. So that doesn't make any sense to us. We don't have an obvious explanation for for this sort of finding. The last observation that I want to make on this graph is we have a good chunk of a good portion of do is returning a 400 level response on the head method and the head method is used. And now if you're familiar with HTTP, you may say, well, this could be a whole bunch of 403 response codes indicating that we don't have the rights to access this research that access is actually forbidden. You may also say, well, this could be a whole bunch of four or fives indicating that this method head is not allowed on that resource. And that is also not the case. We double checked and this portion really is dominated by the response code for for and I'm sure everyone is familiar with that response code, meaning the resource is not there. Which is interesting, especially interesting given that a chunk of those device return a 200 level response when other request methods are used. So what is going on here. Do these resources exist. Do they not exist of this Schrodinger's resources right. It is, it is odd at the very least. However, all these observations that I just went through, and including the changing response codes for when different request methods are used strongly indicate that scholarly publishers and you'd respond very differently to requests against the same do I, depending on what method is used. There are differences between responses when the request method resembles more machine behavior on the web versus a human behavior on the web. Because you can clearly see if you compare all four methods that the Chrome method, the one that resembles a human browsing behavior on the web the closest clearly performs best in terms of most amount of 200 level responses for requests. So I now compare this set of experiment or this sort of results from the external network with repeating the same experiment on the internal network, we can have these these gas basically side by side on the right hand side now. This is depicted the results from the very same experiment. This is done within the, the lateral network, and couple of things are immediately striking. First of all, the ratio of the eyes that consistently return a successful response have increased to almost two thirds. So more than 60 or 67% of the eyes return 200 level responses across the board and significant higher percentage than from the external network. So we're performing as a request method or as a browser used as a human resembling behavior, human behavior resembling a method is performing even better in this case. And we still, however, see a lot of oddities, let's say, a lot of 400 level responses, actually more 500 level responses, and the same or similar fraction of magic resources that change response counts when other request methods are used. I believe they're still a good chunk of the eyes that display a similar behavior that is very similar in our context here. So then we have a now we have a comparison between the internal and the external network. Now we ask ourselves, could there be a difference between resources that are open access versus resources that are not open access. So we looked at our external experiment here on the left hand side, and ran all these device against the unpaved API in order to identify the eyes that in turn identify open access resources with versus those that don't. So the fraction of the eyes for open access resources are depicted here in the center, and for no open access on the far right, and in comparison, the overall results for the external network is depicted on the left. I'd like to stress, however, that the fraction of the eyes that identify open access resources is fairly small, right, just under 1000 device for this category as you can see in the center graph. So it's a very small data set, I would propose to repeat this set of experiments for the larger data set in order to increase our confidence, let's say for these sort of findings. However, I do believe we can see a couple of of patterns emerging potentially, right. First of all, the fraction of the eyes that return 200 other responses across the board is higher for open access content slightly but still higher. There is still this oddity of simple get requests and actually enhance get requests as well, but they get stuck it seems with 300 other responses for open access content. We can see again with the caveat that there are many more do is in this fraction of we can see that all the a lot of the non successful responses the 400 and 500 level responses stem from the right end from the graph shown on the on the far right here from the right end in a non open access content, and also all the odd a lot of the oddities right the do is that return a 400 level response for head request, and 200 level or different level of response for other requests seem to be stemming from that category of do is. So the last set of experiments I'd like to get into is trying to address our question. How does this scenario change if we look at content that way as Los Alamos National Laboratory have subscription rights or license agreements for a bit. So here we identified do is that in turn identify content for which we have subscription rights or license agreements so we can access those versus do is that identify content for which we do not have the right to access depicted on the far right again. Similar caveat here as well that the number of do is is you know not evenly distributed, let's say, and it's a very small data set with only 1266 do is identifying content that we have subscription rights for. So I again I would like to generate a larger corpus to increase our confidence here, but the numbers just by looking at them the first impression that we get is kind of stunning in a way that again a number of significant portions significantly larger portion of return a 200 level response across the board for content that we have subscription rights to. But the question still remains what happens to the other 16% right in theory this should be should be a graph that's entirely green on this context, as we can see it is almost entirely green for the Chrome method, but the other three methods leave some leave some problems to be to be analyzed and explained, let's say. In the previous graph you've seen the good chunk of the non successful responses seem to be stemming from content that we don't do not have access rights to. What does this mean how can we some of this topic up and how can we try to move forward. I think we have shown solid indicators that scholarly publishers indeed frequently respond inconsistently to different requests against the same do I. Requests the responses rather seem to be depending on the HTTP client used on the HTTP request method used and also on the network environment from which these requests are made. So what are the implications here for our for the persistence or maybe the perceived persistence of our device. Well, I would argue that the level of inconsistency for DIY resolution surely does not build trust in our DIYs and our DIY infrastructure and remember that this is the end of the redirect chain that is on the publisher side. I'd also argue that the lack of adherence to standards does not build trust. So here I'm referring to the changing of response counts from one method to another in one method it shows you that the resources for for but another method shows how here it is 200. That does not make me build trust in our infrastructure. As indicated obviously more work is needed in this room, but it seems to be that initial findings indicate. Do I said identify open access content or show a little bit more consistency than those that identify non open access content, and the same seems to hold true for do I said identify content that our institution has a license agreement that I would like to repeat the sort of experiments in different institutional setups to see whether that is a reproducible pattern over there for some reason this is an outlier. On the more technical level of implications of these sort of findings, especially for, let's say archival efforts of libraries and archives. What does that mean. Well, my recommendation would be for developers to test as many combinations of request methods and clients and also network environment as as possible, because we've seen that the responses are not consistent, and we need to find a way that works best in fire use cases. However, it seems to be generically speaking true that if you can make your crawling never crawling environment to appear as human as possible. The odds of succeeding in terms of receiving the most amount of 200 level responses seems to be best. So with this, I'd like to stop now. And thank you very much for for your attention thank you very much for being here. I would love your feedback on these findings I would love to see people reproducing these sort of experiments and other network environments. I'd be happy to have a discussion about any and all that we talked about. And for that, I believe I handed over to Diane. Wow, thanks Martin. That was very interesting and disconcerting to say the least. I see that we already have a question for you that comes from Rob Sanderson. Rob says hello hi Martin. Fascinating research. Did you consider the intersection of non subscribed non away 400 or 500 responses to see what proportion are permission based, perhaps still reporting and correctly 404, rather than other cases. These could be considered to successfully resolve to an error due to the paywall. Yeah, that's a good question. Thanks for that. The short answer is no, we have not. We've not done any sort of intersection or cross session analysis of how these, let's say levels of access right with purposefully kept it simple for now and kept the analysis of away versus non away for the external experiment and the subscription versus non subscription rights for the internal experiment, but I absolutely agree this would be in very relevant and interesting experiment. I would also like to increase the data set to build our confidence on that. The, I'm honestly not sure whether they know 10% of do is is somewhat representative to away versus non away content. But recently we've seen that cross have as made available an entire dump of the idea. So now we have a corpus that can be argued to be somewhat representative. So that would be another way of getting getting more do is and trying to experiment with those. Okay, thanks. Thanks Martin and thanks Rob for that question. And we have a another question that's come in from Robin rugabur who asks, did you investigate the redirects to see if there was something valid. We have a situation with the legacy archival content. I'm sorry with legacy archival content if we were to assign persistent IDs, we would need to crosswalk between the DOI and the repo ID. I'm not sure I'm following your use case there Robin, but I believe part of the answer is no we have not done a more closely investigation to, to the individual links of those redirect chains. We basically just for now looked at the individual response codes and focused in particular on the response code of the last link of each redirect chain. So we're still subject to to future work and it's on our plates now to more closely look at how these redirect chains are made up really how they may be evolved over time. And what the differences are basically what the path is to the resource right depending on the clients used on the environments that you're in. So the first investigation of how these redirect chains are built is is up for future work. And another aspect of that is the notion of, what are these resources actually like, right, did we really land on the, let's say the landing page for example that we expect to land on, or is these are these are what's known as soft, soft fall force. It says 200, but on the page a human can read sorry we don't have that those sort of scenarios are still on our plate is something we have not looked into, but a very interesting part of that experiment for sure. All right, so Robin, Robin says that she will email you on the use case but thanks for going into the redirect further so thanks for that Martin. That's good. Yeah, thank you. Looks like we have another question from Stefano. What is the reasoning behind pretending to be human for a higher success rate when testing. I would think that consistency of responses is more important than success rate, but maybe you're looking at different end goals. Yeah, that's a good question, Stefano. Thank you for that. I would say it really depends on your use case, right. If you're in the business of maintaining an archival crawler, like let's say locks, for example is I would probably make the case that you would look, you would try to optimize your environment for success. If you're trying to build a quantitative or comparative study similar to what we did, you probably want to highlight all the oddities and all the cases where things went wrong and then try to analyze of how they went wrong. So, I think our perspective here was we're trying to put numbers on those observed oddities in order to inform people that this is not an operator. This is not, it's not the case that you can just send a simple request and expect the same sort of answer compared to a more complex HTTP request. And everyone that has done some sort of crawling in the past will know that will have observed that. I was fed up with it, honestly, which is which motivated us to put numbers on it and publish these numbers to really outline. This is not normal. This is not an environment where we can trust on having found the panacea for persistent identifiers, right. If we're treating our infrastructure the way we do, this does not build trust and that was really the whole point, but your distinction is very well taken. I absolutely agree. Interesting. All right, thanks. Thanks, Martin. And we can probably take one more question if anybody would like to ask another one before we close the webinar. While we're waiting to see if there are other questions, I just want to remind everyone that you have joined a webinar that is part of CNI's spring 2020 virtual spring meeting and we're really happy that you were able to make time to come and hear Martin's chat today. I'm just pasting into the chat box, the direct URL to our schedule for the rest of the meeting. So there's plenty more to come. We'll be having live streaming webinars throughout the rest of April and throughout the month of May as well. So there's plenty more to come. Our next webinar live webinar will be next week on Tuesday, April 21. We'll be having a talk on revising the vivo ontology and a couple more that same day. Our first session on our recent call for proposals having to do with the COVID-19 crisis event 201. Why weren't we paying attention and then also a session on simply E also on that same Tuesday, April 21. So please join us for lots of great content to come. It looks like we don't have any more questions. So I'm seeing folks chat in thanks so much Martin was really enlightening and I can I think I can hear the applause from a distance there. Thanks to everyone who joined us and Martin thank you so much for coming to CNI and sharing the results of your work with us today. I'm sure we'll look forward to some updates to the research you have yet to come. Thank you so much for making this happen. This is very fun. Thank you for for being here everyone and thanks for listening in. Thank you so much. Have a great afternoon everyone or wherever you are in your time zone. And we hope to see you back again soon. Be well.