 Good morning. I'm Sharon Farb, and we are so excited to be at CNI and present the preliminary results from our research study. I'd like to introduce the research team. There's Peter Broadwell, Martin Klein, Todd Grapone, and myself, all from the UCLA Library. Let me tell you just a little bit about the format of our presentation this morning. I will give a few introductory remarks. Just kind of situate the context of the research. We really want to spend most of the time on exploring and exposing with you the research findings so far. And then Pete will describe the methodology of the research. Martin will give some highlights of the preliminary findings. Todd will have some concluding remarks, and we've tried to save time for some questions and discussions, including how we might collaborate with interested CNI members to do some further research on aspects of this together. So I'd just like to highlight a couple trends that provide context and kind of situate the research. So as everyone knows, there's rising costs. There's a growing, a significantly growing research corpus, particularly with respect to the sciences. There's an increase as well in research funding. Let me pause for a minute and just highlight what's on the slide about the the cost, the revenue with respect to STM, and in particular journals, because journals are the focus of our study, and journals are about 40% of the STM market, and of that 68 to 75% is coming directly from subscriptions that libraries have. Another trend that is just particularly in the last few years important to highlight is the significant increase, in particular in open access journals in the sciences, and we'll talk some more about that in a minute. And then the other one I want to highlight that isn't new, it's in fact been a constant in scholarly publishing, is the critical contribution of our faculty and researchers in the production of the scholarly literature. Elsevier conducted a study using University of California data, and this chart comes from the UC University of California impact. This is a chapter from the impact report on publications, and the Elsevier study found that one out of every 12 research publications published in the US were created by University of California faculty. It would be fascinating to do a study and explore what the contribution of CNI members, faculty and researchers are in the production of scholarly literature. One might imagine the results of that study may show something like 10 out of every 12 publications. Everyone is familiar with this chart that shows the correlation in some ways the gap between the steady increase in the rise again of journal publications as well as book publications with respect to the consumer price index. What people may not be as aware of is the study that was conducted in 2013 by the National Association of State Budget Officers. That found that just under half the states in the United States, so 24 states were operating with general fund expenditures at the 2008 level, so that means those of us that are tied, universities that are tied to public funding and state funding, may likely leave some of us not even be at flat budgets, but budgets that are skewed to 2008 levels. This chart comes from a study that was published in class entitled Open Access of the Scientific Journal of Literature, Situation 2009. In 2009, the authors found that approximately 25% of the research publications in the sciences were at that point open access and the chart is, you know, you can see the green and you can see the gold, but collectively it was around that much. In 2015, there was a study where the authors found, sorry, I have the title of it, but that study found that 61.1% at this point of the journal literature is now freely available online. Our study looks specifically at the contents of the archive, the physics and math focused institutional, sorry, repository that's hosted by Cornell and their operating budget, including everything, was just over $800,000 for the period from 2013 to 2017. In contrast, we can't tell you specifically what the corresponding cost is of those final published, but we do know that, you know, the STM US is about $10 million, so somewhere between 1% and maybe 15% is what that number likely is. In any case, we can say that it is likely 10 to 15 plus times the price of archive. In the STM report 2015 that was authored by Michael Ware and Michael Maid, they described the value added roles of publishers. The one we focused on for this study, it is a content analysis, is the copy edit. We have two working assumptions to briefly highlight. They are that if that publisher's claim with respect to copy editing is valid, then the text of the preprint should differ in some form from the corresponding post or final, and that by applying measures of similarity, we should be able to detect and quantify such differences. So with that, I'm going to turn it over to Peter who's going to describe the research methodology and data collection. Thanks Sharon. So I'll just jump right into our first methodology slide. So as you can see, we assembled two different corpora. If you're going to compare pre and post prints, then you need to have a good number of both of those categories. So when we were collecting our preprint corpus, as Sharon mentioned, we focused on archive.org, so most of the papers we collected have something to do with physics and math primarily. And so we actually ended up downloading pretty much everything, at least all of the PDF papers that are in our archive and also the metadata associated with them. The metadata is available through an OAIPMH interface. The PDFs you can download through Amazon's S3 service. Archive has an account. They have a requester pays setup where we pay to download them. It cost us about $60 to download all 500 gigabytes of archive papers. This is totally above board. It's all on the website. They say you can do it. And we paid for it, so we didn't cut into their $800,000 budget at all. And archive also keeps versions of preprints. We used the latest available if there were multiple versions. And as Martin will explain later, using the latest version was okay. We were able to ascertain that we were still working with preprints, not preprints that ended up being postprints. To gather the matching postprint corpus, we expected the DOIs from the metadata that was available through archive. They don't actually assign DOIs, but authors can go in and add the DOIs later. So about 44.5% of the articles in an archive have these DOIs. And then we used those to look up the same articles in their final or postprint version through the Crossref API. And Crossref for some articles allows you to download the full text in XML and also the metadata for the articles. Assuming your institution has subscriptions to those journals, so we were able to take advantage of UCLA's wide range of serial subscriptions to access a large number of papers that way for the comparison. Once we had both sets of papers, we did some processing to get them ready for the actual text comparisons. We converted the PDFs to XML when we could, using fairly standard open source tools for PDF to XML conversion. We suspect they might be the same ones that publishers actually use when they're submitting their papers to Crossref. We don't know about that for sure, though. And then we extracted the various sections of the articles from the XML. The XML does a pretty good job of us delineating those sections for us. Occasionally, though, some are missing. We found that this was a fairly insignificant statistical noise in our overall analysis. But certainly, there is, when you're working with a large, when you're automatically processing a large quantity of text. So there's going to always going to be a little bit of slop. We also extracted metadata and some contents of the papers from archives, OAPMH interface, which gives you access to the metadata. But that also includes the titles and abstracts. One issue with this is that archive allows because it's primarily a science and math or physics, yeah, physics math archive. They allow authors to upload these sections with the law tech markup still in them. And that causes some problems for comparison. So we actually favored the versions of the titles and abstracts we could get from the PDFs instead for that reason. Then we just applied some basic text comparison algorithms to these sections of the papers that we're comparing directly to each other. There's six listed here. I'm only going to talk about the top three. The results from the bottom three are actually quite similar to the results from the other algorithms in red. But we do we do have a web interface that will provide the aura later that you can go to to browse the results from all of the comparisons. So these are fairly straightforward text comparison algorithms. So I'll just outline them very, very quickly here. We'll try to make it as painless as possible. So the first comparison is just the length competing the length ratio between say, the preprint abstract and the postprint abstract, which is just as it says here, it's a ratio of the shorter text to the longer text. We have an example here. So you get a nice number. And if you get a number for all a whole bunch of abstracts, you can start to generate some some good results graphs from from those. As with all of these other similarity metrics, any, you know, the closer you are to one, the more similar the text that you're comparing. So our second method was a fairly well known Levenstein edit distance algorithm, which is a number of edit operations necessary to turn one text into the other text. We have an example here showing that it takes three, three edits, either insertions, deletions or substitutions to transform the left string into the right string. This this algorithm is often used in us, like spell checking, spell checking functions on word processors to say, Oh, did you mean this word? Particularly when the word doesn't show up in the dictionary. And so we actually used to compare the papers we use the ratio, which is a slightly more complicated formula, where you add in where you, you know, you add together the links of the of the text and then subtract the edit distance in the numerator and not in the denominator. So you can kind of intuitively see here, if the edit distance is zero, then the numerator numerator and denominator are the same. So the similarity would be one, which is very similar. But the the more the edit distance increases, then your similarity score will go down. So Levenstein is good at sort of quantifying editorial changes, you know, it catches anything, including changes in capitalization, punctuation, that kind of thing. So which which is useful because this is one of the one of the areas in which publishers can contribute value to the postcards. And our third method was a contrast to to Levenstein edit distance. This is cosine similarity, which is slightly more complicated to to explain in full. But all we really need to know is that it tends to ignore superficial editorial changes, and instead focuses on significant words within the text, and is more sensitive to changes to words that are actually characteristic of a given text. So as it says here, you know, common stop words are ignored, words that are more characteristic within the whole corpus, more characteristic of a particular text within the whole corpus get greater value. So we have our I have an example here, these are actually the preprint and postprint titles of a particular paper in our data set. You can see as we as we begin to run cosine similarity does some normalization gets rid of the capitalization, ignores the stop words. And then just I did some syntax highlighting the show sort of intuitively which words are more characteristic of these texts and we'll get more weight. The word light is not all that characteristic, but it's also not a completely unimportant word. And so that the addition of that word is actually what changes what makes the cosine similarity not one and drops it down a little bit. So just to give you a visual overview of the sections that we compared, we have we have a paper here written by some people in this room that just just really briefly will show you the sections that we were able to parse out of the preprint and postprint. So we there's a title and the abstract. And then after that we grabbed all the text in the body of of the of the article, not including the title and the abstract or in the end here, the references. Oh, there's also the authors, which both the authors and the references are fairly difficult to parse. So the comparison of those sections between the preprint and postprint we've mostly left for for future work. But breaking out these sections gives us let's examine more closely the differences and the character of the differences between preprint and postprint compared to relative to if we were just to look at the raw text only. And so now Martin will talk about what we actually found when we did this. Thank you, Pete. I need to apologize for my voice. I'm getting over a cold. So if I sound like Joe Cocker, this is not a purpose. I apologize. So some preliminary findings. And as mentioned before, this is in a study in its very early stages. So while we're confident that what we are going to share with you today is true, there's still much more work to be done, right? So maybe a couple more slides about insights into the corpora that we have generated. Pete mentioned that we downloaded all of archives at work. This is roughly 1.1 million articles. Half of that give or take had DOIs that we could use to match their postprint version to. And we came up with a total of 11,000 articles. The ratio is pretty bad, you know, but it's still, we felt it's still a good enough size corpus to go ahead with our study. So starting with 1.1 million and up to 10,000. That's not all that great. But it's still good enough for, you know, a preliminary study. Most of the papers that we matched were published somewhere between 2003 and 2015. Which in two degrees makes sense if you consider when a publisher started assigning DOIs to their scholarly work and be even, you know, the use of DOIs really became more ubiquitous, let's say. Authors became aware of DOIs, went back maybe to archive and put their DOIs in their metadata set. So that date range is not particularly surprising. What was a little bit surprising, however, is that the vast majority of the postprint papers that we found were published by Elsevier. Granted, this is an SDM corpus. And Elsevier, and that's probably the reason for this high percentage that Elsevier was, as far as I know, the first partner for Crosstra for their API to provide the full text of the articles. So the vast majority of postprint articles that the Crosse API will provide is actually Elsevier content. So these are, you know, a few dimensions of how our dataset is specific, let's say. I don't want to say biased but specific. The journal that had the most papers represented in our corpus was Physics, Letters, B, had no clue that exists, but apparently it does. So just, you know, if you know what that is, good. All right. Another histogram showing you the distribution of categories in our archive corpus. So archive gives you the nice categories, right, where your papers are submitted to. And everything in red is physics specific. Everything is blue. In blue is not physics. So again, a certain specificity of our corpus, the vast majority of papers are from high energy physics. There's a little bit of math, the second column there. And the other non-physics categories is computer science, quantitative biology, statistics and quantitative finance, I believe. But that's the long tail there. Again, a corpus focused on STM, in particular, your physics C articles. All right. Results, right? Finally, we get to the results. So the next three slides that I'm going to show you have a graph on it that look very, very similar. As Pete mentioned, our similarity measures are normalized. So you get values between zero and one, whereas zero on the right hand side indicates a very, very low similarity and the value of one indicates a very, very high similarity. So left is high, right is low. The height of the bars represents the total number of papers referencing to the left y-axis and the red dot on those bars represents the relative number in percentage of papers referencing to the right y-axis, right? Makes sense? Good. Here in this, what this slide shows you is, oh, and to make this graph a little bit more readable, we bend those values into categories. So values that fall between zero and 0.1 are in the first bin, 0.1 to 0.2 in the second bin and so on to fourth and the last bin on the far left values that fall between 0.9 and 1, right? So what this graph shows you is the similarity of the titles. Pete mentioned that we extracted different sections from those articles. This is the title only. So for example, the length ratio of the titles, more than roughly 10,000 articles had a length ratio very, very close to one or somewhere between 0.9 and 1, meaning the length is almost identical for roughly 90% of all titles. Well, oh, that was not the plan. This is the Wi-Fi kick in and doing something. Just sorry about that. Where's my cursor? No, that was. All right, here we go. What was that? Oh, so that the length of the titles is very, very similar. Now you can of course say, you know, the terms cat and dog have the same length. However, they are very dissimilar, granted, hence we do the other metrics, right? So Levenstein and Colesign give you the notion of how significant the changes were. And since these values are again, very, very high for the top bin for the 0.921 bin, that indicates that there's no significant changes in the in the titles there either. So for example, Levenstein comes in at roughly 70% in the top bin and the remaining 20 something percent distributed over the second from top bin, somewhere between 0.8 and 0.9, and a little bit curious enough, roughly 10% in the bin of values between 0.1 and 0.2, so that demands for further investigation while that is, but Levenstein was on. All right, so that was the title comparison. Now we know that titles are fairly similar. Well, this may or may not be surprising. So we looked at the abstracts as well. Again, as Pete mentioned, we extracted the abstracts too and looked at why would you do that to me, Wi-Fi? All right, here we go. The abstract comparison and the picture there, as you'll see, is fairly, fairly similar. The length ratio and the Levenstein distance is dominated by the top bin. So the vast majority of those values, more than 80% fall into the bin that holds values between 0.9 and 1. So again, indicators, strong indicators that the abstracts are very, very similar. You also note that the cosine similarity ranges in about 60%. So that's a bit lower. And the only explanation, the only reasonable explanation that you have for that is there are not a lot of character-wise changes in those abstracts, but the view that there are maybe entire terms that somehow change the, you know, maybe slightly change the semantics of the abstract, hence the cosine is a little bit lower than the Levenstein distance. But you'll also notice that the remaining 40% of cosine come in in the second to top and third top. So it's still very, very left heavy, right, even though it's not as dominated by the top bin as length and Levenstein is. All right. Then we did the comparison of the entire body. So not the title, not the authors, not the abstract, not the references, everything between, basically, and come up with those results. Very interesting as well. Maybe you know that cosine similarity works fairly well or works better on longer texts than it does on shorter texts, hence the whole notion of contextual, salient terms that it catches up. And if it's the 80% of our articles come in with the cosine similarity somewhere between 0.9 and 1. So that should drive it home, right? The body of those articles is contextually speaking, very, very, very similar. You'll also notice that, excuse me, the Levenstein distance is higher for the second to top bin. So somewhere between 0.8 and 0.9. Examples for that, I'll show you in a second. One example was that we found a lot of preprint papers would have a different way of referencing papers. So you would see, as has been shown in related work, reference one, reference two, reference three, reference four. And in the postprint corresponding version of that article would be, as has been shown in related work, reference one dash four. So one through four, right? Levenstein picks up because it's character-wise and a substitution. Cosine doesn't pick up on that because it's the same thing. So that's the point. Yeah, maybe one example really quickly. This is the abstract of a paper published in physics, let it be, that highlights the differences that we found that our tool picked up on. And this might be hard to read, so let me really quickly point out what particularly Levenstein here picked up on. So forming a color, glass, condensate, whatever that is, is capitalized in the preprint version. It's lowercase in the postprint version. So again, Levenstein picks up on that because it's not the same. Cosine says this is the same thing. There is a capitalization in the term letter for postprint, but not for preprint. And interestingly enough, on the bottom right, if you can see this, in the postprint version, or no, rather in the preprint version, there's the term parametrization. And in the postprint version, it's parametrization. So they're inserted in E because this seems more correct or whatever that is. But, you know, those are the sort of differences that we did find that our Levenstein similarity matrix picks up on. So this was a lot, this was very fast. I realized that, so we made it a bit more convenient for you and built a website that contains all those histograms. So if you fancy, you can go to solarglow.library.ucli.eu slash prepost. I just tweeted this link as well, because we're in this time of the year, right? So I tweeted about this. You can go to this website. You can where's my cursor now? Again, here you go. You can say, give me all the cosine similarity values for titles. And I also want the Levenstein ratio for titles. So you get the Levenstein ratio for titles. And if you scroll down, the cosine titles. So this is for you. You know, a takeaway message. If you go home, if you need some homework, go there at home and check this out. All right. So now, you often hear the argument, well, is this a physics preprint? And I've actually done it myself that I had a paper accepted at a commercial publisher. And then I uploaded it to archive for, you know, a good measure. And the other way around as well, of course works as well. I've done this too. I have a tech report somewhere. There's a long version that no one wants to read. So I put it up on archive. Later on, this was actually not too bad. I submitted somewhere, maybe a shorter version of that. And it gets accepted. So the notion of sequentiality, what comes first is, I think, important here to consider. And so we extracted the publication dates, the reported publication dates of those articles as well. And here's the distribution of that. Everything in red was published on archive.org first. Binned again into date ranges. The black line there is to distinguish because the size of the bins are a bit misleading. Everything right of the black line is a bin size of 100 days. Everything left is a bin size of 10 days. So we have roughly, let's say, 1,000 papers that were published on archive.org first between 40 and 50 days prior to its postprint corresponding version. Right? So at least, at the very least gives you a notion that there's no, it's not necessarily true that archive only holds, you know, the open access version of your postprint article. Right? Because the sequentiality is at least distributed, let's say. All right. One more slide. Excuse me. And Pete hinted on that the author similarity that was really intriguing to us. However, it's not trivial to do that. If everything had orchids, this would be cool, but unfortunately we're not there yet. So extracting authors is not that hard interpreting the names is hard because the concept of a first name and last name is not unique across the world, right? Metadata is messy as we all know. So we have examples where the first name and the middle name are labeled as first and last and then the actual last name is labeled as the first name of the next author. So all kinds of crazy things happening again, right? If you do experiments at scale, crazy crap will happen. So here's well, however, what we are able to do with some heuristics, we can make a safe statement saying that we found roughly 80% of the author lists are identical, meaning the length of the number of the author is the same and also the order of the author is the same. That's for 78%, however, that leaves 22%, right? What's up for those articles? So that's an aspect for future work. We'll need to look further into this. But that's as far as we want to go with the state of the results we have to date. All right, that was it for the preliminary results. I'll hand over to Todd, who will cover a few aspects of future work and spark a discussion. Thanks, guys. Yeah, so this is obviously pretty preliminary work. There's more to do. Martin Pete hit on some some aspects of the work that need to be refined. And we'd like to get, I think, to different kinds of content as well. And also use this work to overlay with other impact factors to see if we can spark some discussion and ask some questions about those other measurements as well. We'd obviously like to move into different disciplines. We feel like what we've done so far is pretty good. But, you know, there's a really a pretty good chance once we get into other disciplines that we're going to see these changes that we can measure move a lot differently when it comes to different disciplines. And we also are interested in collaborators. So one of the things we're trying to do and we're going to do in the future is put a web API up so that you can compare your own content and the way we've done it and your own corpus and come up with your own results. I'm curious what you guys think that you'd be interested. We have our own ideas about next steps. So we want to look at some other disciplines. We want to do some qualitative work. This is obviously all quantitative work. But we would be curious what folks that are hearing about this might be interested in and how we might collaborate together. So we did save time for that. People have any questions or would like to talk about areas that you're particularly interested in then that would be really useful to us because this is just thanks so much for coming.