 So, Ateesh. Yes? Let me just ask a quick question. Is it OK if I pose questions to the audience? I think that's excellent. You can follow the Socratic method. Yeah, I mean, there are just a few places, and I just don't know if the interactivity makes that possible. Let me ask, I think it's, I think, what do you think, Massimo? We can do this, right? Yeah, sure. Yeah, OK. Because webinar is less interactive than the normal Zoom meeting. OK. So maybe we can get started now, OK? So good afternoon and good morning, everybody. It's a great pleasure for me to welcome Paul Ginsberg. And welcome all of you to this colloquium by Paul Ginsberg about the archive, sort of the past, present, and future kind of. And Paul has been a professor of physics and information science at Cornell University since 2001. In fact, he has worked on areas of physics which are pretty close to my own interest in quantum field theory, string theory, conformal field theory. There was a very popular, as a student, I remember, review by him on applied conformal field theory. They're still popular. Still popular. Still very popular. And many of these big IT companies, they got started in some garage in Palo Alto, they say. This one, I know, it's got started in Aspen. And I can sort of say that I was present when it got started because I was a postdoc in those days. And there was this kind of an informal network of exchanging emails, which was sort of taking shape and which was useful in the string theory community. And then Paul sort of said, OK, this is not the best way of doing things. Why don't we just make it proper and automated? And I think in just one afternoon, he more or less got the basic thing set up. And then it has really taken off in ways that perhaps he also might not have imagined. And it has really become a major mode of scientific communication. And it has had a huge impact, I think, on the world of science, and especially for the developing country. And for ICTP, it is particularly important because, as you all know, for ICTP, it's really fundamental to ICTP's mission to make science available to everyone around the world, sort of overcoming the barriers of geography and economics and gender and ethnicity, et cetera. And the old method of sort of exchanging preprints, some people used to call it the white plague. So it was, first of all, very restricted to a group of people, not available to anybody. And I think the archives have completely changed that. And I really look forward to what Paul has to say about what he thinks and what his experience has been. He, of course, has a recipient of many major awards. He's a fellow of American Physical Society. In 2002, he was a MacArthur fellow. And he has been on advisory board of many of these major initiatives, like CODATA and so on, on open science. In 2013, he was named the White House Champion of Change. And as you know, open science is also becoming an important theme in recent deliberations at UNESCO. UNESCO is currently working. So ICTP, as you know, is governed by a Trapartheid Agreement between Italy, UNESCO, and IAEA. So UNESCO is one of our stakeholders. And they are currently working on the development of an international standard setting instrument on open science. It's like a normative document in the form of UNESCO recommendation on open science, which will be adopted by the member states. So in that direction, in fact, Marco has been actively organizing a number of important seminars. Prior to this, we had a very nice seminar by Anna Persich from UNESCO. Then also from Kamran Naim, who is the head of open science at CERN, especially with particular focus on scope 3. And it's in that series that Paul's colloquium was organized. And since it would be of broad interest, we decided to hold it as a colloquium because I'm sure it's of interest to many people in the Institute to make it broadly available and also to people who might be listening in from elsewhere. So we are really looking forward to it. I should just add that the archive really has had a huge impact. I mean, I myself come from India, OK, even though I worked outside of India most of the time. But when I returned to India, archive continued to be the source that you go to to get journals. Instead of going to the journals to get the latest that is happening in your field. And in fact, for example, many people's careers in developing countries, for example, Marco just informed me that many people during the Venezuelan economic collapse, many people were able to complete their graduation, essentially depending completely on the archives. So it's a very important contribution to this general concept of making knowledge broadly available and somehow protecting it from the evil empires of profit making. So with these words, I can also add another thing that ICTP has another in-house project, which is in line with the goals of the archive. We have this electronic journals delivery service, which we created at ICTP in 2002 for least developed countries. And the idea is that we make journals available to scientists in the developing countries in the email format. And perhaps there are now better ways of doing this. So I really look forward to hearing what Paul has to say and perhaps hopefully we will have the opportunity to invite him to ICTP in person. What you're missing out on is very good coffee and some good Italian dinners afterwards and direct interaction with some of the students and people, professors here. So hopefully we can continue this dialogue and think of ways in which what role ICTP or organization like ICTP and CERN can play in the years to come. So over to you. I should perhaps add that he is a, used to be at least a avid hiker. And a youthful picture that you see in the announcement is from the days when he wrote the Applied Conformal Feed Theory, which was in the 1990s, I think. So Paul, to you. Thank you, Atish. I would have been happy if you would keep going. That was good. For the record, I do make myself, I ground my own Italian coffee this morning for my cappuccino. So I'm doing OK. If it really was a single afternoon, I've been paying for it ever since. But let me go to slide sharing. We've done this already. Wow, there are a lot of people here. This is frightening. And I have to play. So is this good? OK, so let me just say as preamble, when Atish invited me to give this, I said that, you know, sorry, I don't do Zoom even all of last year during the pandemic. I, you know, was sufficiently unhappy with it that I agreed to continue teaching my classes in person all of last year. We had comprehensive surveillance testing here at Cornell. So it was safe. And just now, I've realized unexpectedly the biggest debit of doing this via Zoom and not traveling. Of course, you know, I would have loved the good Italian food there and all the rest. I haven't left this room in a year and a half. But the biggest debit is that I ordinarily prepare talks while I'm on the airplane to my destination. And having missed that, I had no opportunity really to think through what I wanted to say. And on top of that, I've been stuck coincidentally in archived external advisory board meetings. There was an annual meeting for the last two days. So you'll have to forgive me if I don't have everything as meticulously assembled as usual. But I'll try to give you a flavor of some of the early history that I teach mentioned and then try to discuss a few occasion, a few technical aspects and then try to project where things are going. So I'm starting here with a screen grab from on the order of 20 years ago of what the interface looked like. And this is a more recent one in my second slide. You'll see there isn't a huge amount of difference. And that's partly by design. But when I used to give talks about this, I joke that there were two ways of looking at it. One, the half glass full point of view was, oh my, what remarkable foresight that software written in the mid 1990s is still up on the web and running. And that's an impressive achievement. But then the glass half empty point of view is, oh boy, does this resource ever need to be rewritten? So that's going to be a theme sort of underlying it. We have a community and some of the infrastructure needs to be refurbished. But by and large, it continues to serve the original purpose. Here is a graph of, excuse me a moment. Well, OK, the zoom windows at the bottom are occluding the bottom of the graph. So I'll have to do this sort of by memory. Oh, plus I don't seem to have a cursor to point to. OK, that's unfortunate in keynote presentation mode. So this is a graph of archive submissions starting in the early 1990s. I just imported it into keynote last night. So it's up to date as of yesterday. So you can get a feeling just for the continuous growth that we've experienced. Let me amplify a couple of things that Atish said while I've got the slide up. It's a good reminder to go through the early history, which is fun to realize is before many of the people in the audience were even born going back to 1991. When, as Atish mentioned, there were emails of papers being sent around in the string theory community. But long before that, before, believe it or not, before the internet even existed, going back to the early start of my own career as a graduate student in the late 1970s, people were passing around preprints in the paper version. And it was a system that was very unfair in many ways that people who were at elite universities frequently in the US had privileged access to the information. When I was a faculty member at Harvard in the mid 1980s, I would have a list of people to whom I would send paper photocopies of my preprints. And they would be able to, and I would receive them from other people, and I would make photocopies of them, or I would more often lend them to graduate students or to postdocs, and they would make photocopies. And we were aware that this was an unfair system. We didn't see a way around it. It had been in place before we started when I speak to the earlier generation of physicists. They say, Steve Weinberg, who sadly passed away recently, told me that they were doing this in the Mimeograph era, starting at least in the late 1950s. So it had long been a tradition among high-energy physicists to be sharing prepublication information and what is now called the open science movement, making it freely available. And the problem was, if you had some development in science, you like to think it was because you had more intuition or you worked harder, but not because you had privileged access to information. And I was, as a graduate student, I spent a year at CN SACLE, just outside of Paris. And I was in the year 1979 to 1980. And I was confronted with this when I saw that the preprints, and this was a prominent institution in Western Europe, but even there, the preprints coming from the United States were sent because of the expense via boat mail over the Atlantic. And so we're coming three months after I would have otherwise seen them in the United States. And the researchers there said to me, gee, a lot of times, we don't even receive these preprints until the trend that they initiated is over. It wasn't completely obvious to me that that was necessarily negative if you got to miss some of these things that fizzled out. But I could see what the idea was if there was something that was going to take off that the rest of the world just fell out of it. And so tech didn't really begin until 1984 when something closer to the current version was released. Email didn't really begin. There were a variety of different networks during the 1980s. There was the so-called DECnet, the IBMs. That was the Digital Equipment Corporation. No longer exist, IBM's BitNet. And then Harvard, where I was at the time, was connected to the internet in 1987. The internet is not a common confusion among my students at Cornell. The internet is not the worldwide web. The worldwide web is a set of services layered on top of the internet, which is the low-level protocol for transmitting packets around. And so we had the internet for email and for file transfers for a number of years before the worldwide web started in the, let's call it the, I mean, it was Tim Berners-Lee initiated it, the first proposal in 1989, but the software of where it didn't start until 1992, 1993 timeframe. So this was prior even to the worldwide web that we started sending these preprints out via email. And it was still unfair because there were still these privileged loops. And somebody made the comment in Aspen that summer in June of 1991 that was a funny comment that he was afraid to spend a wadiah who said he was afraid to travel from India because all of these email messages would overrun his disk allocation and he wouldn't be able to receive subsequent email messages until he got back and was able to get on his email. And so that prompted, I discussed it with a few people in Aspen that June. And we decided to set up a system where the original plan was just for people who were doing matrix models of two-dimensional gravity, these ways of using random matrices to simulate two-dimensional surfaces. And I remember the design was just to accommodate about 100 people and was anticipating, well, a few hundred people, but I meant to say was anticipating about 100 submissions per year or about one every three days. And so that's part of what this graph represents that from the day it turned on in mid-August 1991, it always received at least one submission and the story after that was continuous growth. I want to tell one anecdote which involves the ICTP in an important way, which was the reason why it didn't go up immediately after Aspen in June is because I was traveling in Europe that summer and part of my travel was to give some lectures on string theory and 2D gravity at ICTP in July of 1991 and there were a few hundred people in the audience and including many students. And I already knew that this, I hadn't yet done my afternoon of software writing, but I knew that it was gonna come up. And so I sent out, that was one of the few places I was lecturing that summer and I passed around a list for everybody to sign up at that meeting in Trieste and that was I think a fairly important addition to the original list of people that were notified of it when it turned on a month later. So I'm grateful to ICTP for that participation. Just to, I wanna get on to other things and but I want to tell just so because this early history is in a landscape that's so much different from what all of you have experienced growing up in the digital world. Tell one anecdote, which really encapsulates how new and strange many of these things were. And after about two years, as I mentioned, well, I didn't mention, but the physics community was because the World Wide Web had started at CERN for use in their experiments and then propagated to the US via Stanford Linear Accelerator Center. And some of the original software for the World Wide Web was written by Tim for Next Step and I was using a next computer that was the company started by Steve Jobs while he was exiled from Apple. And it's the operating system which subsequently became Mac OS X that everybody uses and of course iOS on current Macintosh computers but it was a fringe exercise at the time. And I then ported it to the World Wide Web after about two years, which wasn't being quite heavily used yet but we had an interface anticipating it would catch on. And the American Physical Society in the 1994 timeframe was thinking about putting its own holdings online and they were creating an interface and they had asked me to serve as an advisor for this and they wanted to know what software they should have for users to be accessing their service. So of course I suggested what we were doing which was a World Wide Web interface. This was before PDF. So we were transmitting postscript files and said just install World Wide Web browser and access it that way. And the answer came back from another member of the committee which was so unforgettable that it's etched in my memory that was all well and good but learning to install and use a World Wide Web browser was a very complicated and difficult exercise and couldn't possibly be expected of the average physicist. So they went with another interface entirely which lasted about half a year to a year before they as well succumbed to the inevitable. Okay, so I can't see the entirety of my slide so let me just, this is an easier one to discuss. So I've said most of this already that email interface started August 1991 because I'm just as incapable of cleaning up the attic in my home I never deleted any files and so all of the data about accesses and everything from the start of the system still exists somewhere every time I get a new computer I just put the entirety of the previous computer in some subdirectory and so now there are about 20 levels of this hierarchy that exists somewhere and I think might even be backed up. The total number of documents starting from that plan to 100 submission per year I checked last night to update this slide is now grown. I wrote down there 1.95 million documents the purview expanded from the original high energy physics to some combination of physics, math, biology et cetera. The current growth rate I just checked were anticipating about 185 new submissions this year and I wasn't aware of this. I had done these projections a few years ago when they came out right on the nose that means we're estimated to hit more than 2 million at the end of 2021. It gets very close, it'll be fun to watch it go through the threshold if you add we're expecting another 40,000 or so perhaps 38,000 from now through the end of the year and so it should hit sometime in mid to late December which coincidentally is when we hit the 1 million mark mid to late December in 2014 and then the nature of these exponential type it's not quite exponential but the nature of these increasing growth rates is such that took 23 and a half years from August 1991 to hit 1 million by the end of 2014 and just seven years since 2014 to hit this 2 million mark. Just to see what the distribution is and where the recent growth has been occurring the plot at the left shows the submissions per year separated according to subject area and the original high energy and the graph at the right is the same data it's just scaled so it gives percentages so early on there was only high energy physics and that's 100% and the history since then has been of course the growth in other fields and due to the scale it's not quite as flat that is the blue part in the left hand graph at the bottom it's not as quite as flat as it looks but it's relatively flat the signal from that is simply that the high energy physics community by the late 1990s was already 100% on we were already receiving on the order of 100% of the articles that were being written so there just wasn't any room for growth that it saturated in let's see one other anecdote again just to you know now it's institutional but there was a point when it was heterodox Michael Turner at a meeting in the 1994 frame you can see the red here is astrophysics and he made a comment and you know to give him the benefit of the doubt he probably was not entirely serious but he said he doubted that this would ever catch on in astrophysics because it was natural that it would catch on in high energy physics because of course these were people who were only interested in the work of the previous 50 nanoseconds or so but that it would never catch on in a serious field like astrophysics and you know so that has been the story you can see the red starts growing and they're still growing partially I think you know in astrophysics could really be there are more articles being written mathematics you know started turning on in the late 1990s you can see this magenta region there and the recent story and this has been a bigger explosion and growth than anything has been this orange and that's been in computer science you can really see that in the right graph where it scales you can see the percentages and you know computer science which was a negligible percentage and the 2000 range is now you know looking like about a third of the incoming I mean computer sciences you have to count all fields of physics and so it's not bigger than that and this has been very closely tied to the explosion of interest in machine learning just as in physics one of the you know there were all of another historical note I remember bemoaning in the 1990s how we had just missed the high T superconducting fat of the late 1980s where things were propagated via illegible enthorter and these explosions of interest usually catalyzed new people to join in and once people joined in they typically stayed and we don't have any evidence in other words of a community who had adopted pre-publication dissemination of information and then later abandoned it but you know we missed that there was a magnesium boride magnesium boride superconductors in the late 1990s and then in 2007 iron nictide superconductors and you can see each of them catalyzing interest in you know creating something of a step function and that's what happened in CS 2015-2016 it became the place to stake priority claims once they joined it then spreads to neighboring communities just like it spread from high energy physics to the first closely neighboring community in math which was algebraic geometry and then continued from there so happily I don't have I see I've got about 15 minutes I can choose some topics and then lead to the majority for the questions I just wanted to point out in the historical context that some of this was once cutting edge but you know archive was really the first to pioneer some of these things that look pretty obvious now like using the abstract page as a hub you go to and the abstract then becomes the landing page where you have links to the full content and to associated material it wasn't obvious at the time because the publishers back then were treating even the abstracts to articles as copyrighted proprietary material which couldn't be reproduced linking the author names you know but then I don't know if you can understand that it's also something that we did first I had an employee back then who had also worked on the primordial version of the Internet movie database which was then later bought and commercialized you know the fact that it was some skeletal structure which all of whose content was provided by the users and cloud service in the sense that we were storing everything elsewhere and people usually found it more convenient to find their articles in the central repository than to find it on their computer somewhere a quick slide on some surprises you know I when it started it predated all of these other resources I mentioned here Google Wikipedia Facebook Twitter all of them are in some sorts are in some sense built on the power of crowdsourcing this web 2.0 idea of providing a skeletal framework and facilitating this you know all of them to me were surprising I argued that something like Google wouldn't be possible because every query would return tens of thousands of items and you know one of the many things I was wrong about because there were people who had better intuition for things Wikipedia also struck me as you know something that was completely out of the question I would be just covered with graffiti and now I like everyone else regularly refer my you still need somebody to curate it but I regularly refer the students in my class and myself refer to Wikipedia entries for quick overviews of information the fact that we're still using tech is a gigantic surprise it's one of the things that he just provided such an encyclopedic demolition of the problem of print on paper but that's what it was intended for it was never intended as a network transmission format and you know we're seeing issues from it it's very awkward to use you have to you know turn it upside down in order just to get it to you can't query it you to just to get it to you know the author is what's the title if you're if you only have access to that source and we'd be much better off if we had the authoring tools that people like for equations to produce high quality XML but that's a transmission transition that still underway so I think of tech is you know like space shuttle technologies from the 1980s but the best we have scholarly publishing as a whole all you know spoiler alert still obviously in transition and this is one that I've gotten wrong so many times I remember stating you know already in the mid 90s that we were in some kind of metastable state that which couldn't possibly persist because we had the publishers who were doing the quality control which incidentally I regard is is essential and funding their operations again I regard you know it's necessary they provide a useful service especially the professional societies and they have to be compensated and be able to support those efforts and they were operating side by side with the preprint server which was providing the same information for free and you know I thought this couldn't possibly persist for another few years I still assert I was correct but my calculation of the decay constant was off by at this point at least a factor of 20 so I still believe there has to be some kind of break out I don't quite understand the way it will happen and you know when it happens I'll come to that in a moment I won't have much time for it but it's not going to be just in physics we've been seeing forays into many other fields and you can see a transition that's going to happen I thought it would be a first order transition and we'll just have to see what happens so I want to just say one word or two or three slides on something technical involving how we compute on these things that you know there have been some technical developments particularly in the machine learning round and you know one of them which got me by surprise that I'll describe is what we use as the text classifier so there was I'll come back to this in a moment so there was excuse me so there was a problem let me pose a problem to the audience and I'll give you the answer so in the web interface we have people trying to submit to for example high energy experimental physics but instead accidentally submitting to general relativity and quantum cosmology now why would anyone who is intending to submit to high energy experimental physics accidentally instead submit it choose from the menu items on the selection general relativity and quantum cosmology let me just pose this question to Ateesh why would that happen Ateesh who put only one person on the spot so the question is why would somebody submit to you're a high energy experimental physicist you've just written your article about measuring the beam as on lifetime and comparing it to epsilon prime over epsilon you get on and you submit to general relativity and quantum cosmology why in the world would you do that I'm quite capable of doing such a thing but without thinking by that specific confusion general relativity okay I chose the person who should have been most capable of answering this question this is even better than posing it to the audience so the answer is you're in a pull down menu and yes you are capable we all are capable of doing this you're in a pull down menu which is ordered alphabetically okay so you are quite certain that your mouse was on high energy physics but there was this last minute mouse flutter when you pulled your hand off the mouse that instead routed it to general relativity and you didn't notice and so I was just curious and I mentioned this it's this mundane thing but it actually was more so you know I thought well it should be kind of easy and you know checked with computer science people on how you do text classification and all of that and you know just harvested from all the text and that's what this next slide was about the first version of this just used this so-called naive Bayes methodology and that was one that turned out to be incredibly easy to disambiguate because high energy physics uses all of this vocabulary which is just specific to that domain and so that turned out to be the easiest one and I have a picture here of just what's known as a language model which is just the probability of word distributions and this is a measure of the similarity between probability word distributions for what was on the order of 140 archive categories at the time and the signal from this is how much structure there is the colors towards the yellow and red are more similarity and this is something known as the Colbeck-Liebler divergence in physics it's just the cross entropy between the distributions and you know you see these blocks of they're labeled alphabetically so with the upper left is the six astrophysics categories and then the condensed matter categories and so that's what that block diagonal structure is and if I zoom in I can see probably read it a little better and you I guess it's well for me it's obscured so I'll have to by the zoom pictures at the bottom but by memory I remember there's a similarity between you can see how they are similar to the math topics than it is to high energy experimental physics so you know that's something intuitive people to people now one of the really cool things and I'm mindful of the time but I do have to mention this because it has been critical is that I then you know implemented this thing and it's been operating behind it's been helping to assist the moderators and ensuring that things are appropriate to categories one of the issues is I should have mentioned that you know from the outset we've we've never been as naive as some of the large social media companies and refusing to acknowledge that there might be nefarious elements out there not acting in the interests of society and the system and you know wrote about that as well in the mid 90s what happens when the rest of the world discovers the internet which of course has since happened and are we going to have refutations of special relativity quantum mechanics as long and grand unified theories and all the rest from over eager high school students so we've always needed some form of quality control and things everything passes through and principle of this unforgiving 24-hour turnaround that's built into the system that had obvious holes and I needed I decided that we needed automated services because you know a computer can read every word of every full text unlike a person never goes on vacation never gets distracted never gets busy never gets sick and so this was the the one certain way of doing it oh my goodness there are 12 messages in chat but I can't look at them so I would watch the classifier output and see every once in a while it would spit something out as unclassifiable and every time I would look at the things unclassifiable I would notice systematically it was stuff that was coming let us say from outside of the research community and it struck me that here I was just trying to solve this completely mundane problem of the mouse flutter causing confusion between alphabetically related entries completely unexpectedly I had actually created the holy grail crack pod filter and I had an effective mechanism which has remained in place embellished it's now a deep learning exercise and all the rest that can distinguish fairly reliably the science submissions from the non-science submissions I can say and I can point you to this article that gives a good description of this in the time frame five years ago before it moved into the deep learning era another there is Sai-jen which produces these I won't say much about this at all because I'm out of time that produces these fake papers it's hard to imagine why people would put their names on them much less submit them to archive they're just computer generated nonsense articles but you know they come in and there's a fun methodology for detecting those as well this was a screenshot of something I submitted to nature and this just shows what these fake articles look like and it's clear that the two are came in they're a decision boundary to distinguish the yellow or actual real articles from the fake math articles fake physics articles they're shown in this graph another thing that I was just going to touch on briefly which when it was first implemented with semi-contentious You know, the text overlap that the articles you may have noticed are occasionally flagged as overlap with articles by the same or by the same authors or other authors and sometimes even without attribution. And the way that works is, you know, and very efficiently, again, you might find it remarkable as I did that it's possible to take a couple thousand new submissions every day and compare them to the entirety of two million previous articles. But it works by converting, in this case, we decided to use seven grams, that is seven word sequences, into hashes in a database, which is 20 or 30 gigabytes for the two million articles. And that used to be big, it is no longer operating on 120 gigabyte machine. It fits in RAM and works efficiently. And so you can compare comprehensively every article to every other article. And the, you know, the 2000 articles takes less than a second every day to do all of them against previous ones. So that's useful, especially for flagging duplicates or user errors. They've submitted the an older article. It's amazing how many people perhaps a teacher has done this as well of putting the long tech article when he's trying to submit a new paper. I shouldn't project on him, but he's the one face I see. And I just wanted for fun to point out one lesson for students. I, we've written articles about this and I just want to make clear this is tax overlap, not plagiarism, which is theft of ideas, but it's still useful. But I wanted to turn this into a minor pedagogic exercise to you in writing thesis acknowledgments. Here you see two different articles by two different people. It's up there in public, so I'm not exposing anything. And you can see, you would think that writing your thesis acknowledgments is one of the unique places where you can be creative, but no, even there sometimes people just copy verbatim previous thesis acknowledgments, you know, I wish to extend my thanks to all faculty members of the department always been there whenever I needed support. So it's all sort of like the hallmark card method of thesis acknowledgments. Above all, I think the almighty, etc. But then, you know, there's a risk that I want to point out to all of you, another example of one of many of these where they thank thesis advisors and all of the rest. But you can see the one on the earlier one says, I cannot describe how indebted I am to my girlfriend, Amanda, who's love and encouragement, etc. Whereas the one in the right says, I cannot describe how indebted I am to my wonderful wife, Renata, who's love and encouragement. So this is the pedagogical message I want to send to all of you that if you do this, it's very important to change the girlfriend Amanda to the wife, Renata, because, you know, if not the consequences are potentially worse than just copying text. So this was, you know, nominally, I had a few potential stopping points and I inserted this slide as one of the earlier ones. I'm happy that I got to this slide and roughly what the time frame is. Summarizes some of the things I said as it says here, I, you know, gave a couple examples of use of the full text data. I haven't even discussed the usage data, which is extraordinarily rich as far as seeing what trends in the community are and just left a bunch of open questions here. But since I'm, you know, already a few minutes over time, I think this is a good place to stop and instead entertain questions so that I can focus on things that are specifically interesting to the audience. OK, thank you so much, Professor Greenspar. It was an interesting talk. And there is a number of questions I've seen both in the chat and the question and answer. So I will read them and I will try to moderate the questions. So there's some technical ones. So I will start from one about comments on papers. So the question is, do you see any reasonable way to generalize archives functionality to be used to, you know, for comments on papers? Thank you for the question. This is something else that, you know, was suggested and looked like a natural thing to do 25 years ago. And we considered it at the time and explicitly decided against it. And I actually haven't had reason to revisit that decision. The reason we decided against it at the time was because the methodology was to be entirely automated. You know, we're doing all of this, you know, the two million papers with this completely skeletal staff, let me backtrack slightly, you know, in the original 100 per year, you know, I was anticipating maybe one human intervention per month. And that would be, you know, consistent with my not having to worry about it, having scaled up by a factor of 2000 from that, multiplying by one out of 30, that's, you know, dozens of interventions per day. And it's actually more than that because the nature has changed. But, you know, nonetheless, it's handleable with automated software and with a skeletal staff. By contrast, the concern was that comment threads would have to be much more diligently moderated and that we would need additional staff to be focusing just on that. And instead, it was preferable to, you know, make this a distributed effort. The the interface is completely open in the sense that if you look at the URL design, which is one of the things that I designed in when it first went to the web in late 1992, it's just, you know, ab slash URL PDF slash URL. So any external service could provide a comment interface and it wouldn't break. I claim without direct evidence, but nobody has ever given a counter example. I claim the greatest success of any website in the history of the web in never having broken a single URL. The URL scheme that was in place in 1993 is the same one in place now. And, you know, none of them have changed. None of them are these inscrutable hex hashes with cookies and all of the rest. And so I thought that it would be much more auspicious. And we encouraged via blogs that and via these comment services, where the effort to keep these things civil. And, you know, you know, the decision not to do it is actually being been reinforced by all of the trolling that you see in newspaper comment for and all the rest. And I'm not saying it's an unsolvable problem. It's just so much more effort than it has taken to produce what was this very clean methodology. And so I don't expect comments to be layered on, but we continue to actively encourage anybody who wants a comment interface. And let me mention one other thing. There was also a signal that came very clearly from young researchers who didn't want the comments on their articles who were afraid that some prominent physicist, you know, would dismiss an article and that could have negative career effects. And that highlights a very essential issue that you have to be much more careful when you're the primary distribution mode that everybody is looking at. If there is some heterogeneity, heterogeneities people are looking at different comment for with different types of comments and different thrusts, then it's one thing. But, you know, we really ran the risk of having everybody who accessed the paper first confronted with some very negative comments and not looking at it. And and then there were the people who said they didn't want comments because then they would feel compelled to keep checking their article every day to make sure nobody had commented negatively that they would have to respond to. And so it wasn't even, you know, clear that the community at large even wanted it, even though we all regard, you know, when we see comment for we benefit from them. But, you know, it just made sense to keep it at a physical and logical reserve from the main distribution site because the the risks of getting things wrong were too high. That's a great question. It was a question about formats and so they're asking, can you share your vision about the future of LaTeX and which formats should scientists move if any in the future? Yeah, I I did make some comments about that. I personally am not a LaTeX user. I started with tech before LaTeX and so I use just plain tech, but it has the same deficiencies. I, you know, it's if you talk to mathematicians, they think that LaTeX will still be, you know, the format 100 years from now. To me, that's completely unimaginable. But, you know, it's another one of these surprises that we we shifted en masse from to tech in the 1984 time frame because it had every advantage over which preceded where incidentally what preceded was taking your handwritten manuscript and having what was then known as a secretary, now an administrative assistant, type it for you. And remember, they didn't understand what it was. They were typing and, you know, they had an IBM selector typewriter with exchangeable balls to do all of the Greek fonts and everything else. And you had to bribe your way to be high up in the queue to get your article typed before all of the other articles waiting for them. And, you know, so and we weren't permitted to charge the chocolate bribes to the grant. So when tech came along, suddenly we could produce it. You know, this was also the era, by the way, where cut and paste was not a metaphor. It really meant they had to take a scissors and cut the paper and glue parts of it onto other parts so that they didn't have to retype it. And, you know, so it had every advantage we switched over amazingly quickly, whether it would actually save time for us, of course, is is is an interesting question. You know, you're always transferring from support staff to the to the researcher. But at least we could be more satisfied with the output. And it had this additional spectacular advantage that then we could we didn't have this before we could send emails back and forth with equations in them instead of just describing them we could use, you know, tech was a transparent enough ASCII format that it was amenable to discussion. And so it had all these advantages. On the other hand, it's not a semantic format, you know, where it's such an incredible disadvantage when, as a teach mentioned, I was on the advisory board for the NIH PubMed Central repository where they had this high quality information feed and they were getting it in a modern XML format. And there was there was so much data mining and linkages and other things that they could do, which are simply impossible in PDF. And, you know, there are attempts to make latex HTML, but it's always kluge. It's always awkward. It's always difficult. And in many ways, latex has been like a ball and chain. We started out ahead of these other communities, but they now have an advantage because they're using modern semantic formats and we're not. On the other hand, you know, we don't have the authoring tools. We're used to latex. I'm absolutely stunned when my high school children doing these, you know, with there are these online, I'm momentarily blanking on the names of them, and some probably know, but for high school students, these math problems online and all of the rest where and I saw that my high school children were being taught to use latex format for transmitting formulas back and forth. So again, a testament to Knuth that he designed something so brilliant that it's outlasted multiple computer languages, multiple protocols, and all of the rest and will be very difficult. But, you know, I don't know when, but I fervently hope and expect that we will make a transition from it. Super. And then we have two related questions, I would say. So the first one is, why hasn't the archive over the initial 10, 15 years expanded to other vastly different fields? Now that it has grown into bioarchive and other clones, why were they not centralized all under the archive? This is one. And the other one has to do with the high energy physics community, which avoided high publication fees. And the, you know, the person thinks that the archive has played a large role and have any other science fields try to imitate the archive. And why is there, you know, other scientific fields willing to pay large publication fees? So there are lots of elements of these questions that I will be unable to answer. But I'll give a stab as to what I know. So the first part of the question was, why didn't, why wasn't there more rapid expansion into other fields, say, in the 1990s? And I was in contact specifically with people in biology at the time who were asking this question. And it was very prominent. The PubMed Central that I mentioned was actually a consequence of a talk that I gave. I was invited to give a talk at Cold Spring Harbor Lab in the 1998 timeframe explaining what I was doing with physicists and, you know, were biologists going to be interested. And many times during this period, I would write these, you know, semi snide remarks about how it's so nice to welcome the biological community to the late 20th century, better late than never. And it still didn't catch on. And I can't tell you all of the reasons. One of the ones I kept hearing always struck me as completely ridiculous, which was they kept saying they had this fear that if they posted it, then they would be scooped because they had built into their mentality that it was possible to give a talk at a conference and have somebody in the audience go back to their lab, reproduce it. And if the other person published it first, then they would get complete credit because the only thing that mattered was the publication date. And, you know, this is completely nonsensical to us. You know, when I was a graduate student, one of the prominent examples was the attribution for asymptotic freedom was always Gross and Wilczak, Pollitzer, and Gerarda Tuft unpublished remarks at 1972 Marseille Conference. And that actually was such a serious obstacle. It, as far as anybody can tell, it delayed the Nobel Prize for Asymptotic Freedom until they could somehow get it down below four people due to these unpublished remarks. And of course, a Tuft and Beltman got a Nobel Prize for the weak interaction and technical methods. And that then opened it up for the other three after politely waiting a few years so it didn't look too unseemly. And so, you know, for us, it didn't make sense because one of the major uses of archive was to use it to stake a priority claim. And once it was out there, everybody agreed. And that was one of the reasons it was so successful. But I couldn't convince them otherwise. And just as an indication, when I was on that, originally PubMed Central was supposed to have a pre-print sector. And just to tell you what the obstacle was in the early 2000 timeframe, the NIH, really the National Library of Medicine really, really wanted PNAS in order to make this thing successful. They needed PNAS to voluntarily deposit its content in this open access mode. And the then editor-in-chief of PNAS, that's the proceedings of the US National Academy of Sciences, Nick Cosarelli, who was also on this advisory board and we had the meeting and he just stated flat out, we will not deposit our content if it is side-by-side with the pre-print repository in every way, shape or form. So there was just this explicit antagonism towards the idea that I can't quite explain. And to all of you having grown, all of these people are mentioning, I'm mentioning are even more fossilized than I am currently. And who just whose minds could not be changed, but all of you are growing up in this environment of digital sharing and everybody shares their lives on Twitter and Facebook and YouTube and just can't imagine this mentality of somehow the science they do has to be closed off, but that never penetrated. And that sort of was other fields were too dominated by the journal hierarchy, they were told, if you send it out as a pre-print, that counts as publication and therefore you can't publish and they were just too dominated by the editors of journals and were afraid of career or reputational risks. And that's why I mentioned that it wasn't revolutionary for our field by any sense, it was just a small switch from going from the existing paper pre-print distribution to doing it online, making it fairer and more efficient. And the best I can do is say, you could have asked the same question or as I could have as to why none of these other fields did it in paper form. And whatever that obstacle was, it continued through at least the early electronic era, but I wanna finish on a more positive note. I mean, there is of course, and this is part of the second question, bio archive, meta archive. I'm on their advisory board as well. And I just got a polite query from the people at Cold Spring Harbor who were setting this up in 2013. When they asked me very deferentially, we think that biologists just need more handholding. We want to set up a separate service which is better optimized for them. Do you have any objection to this? And my answer was, look, it's a free country. You can do whatever you want. If you can do it better than me, then, and this is a community that's proven remarkably resilient to doing the right thing. If you can encourage them where I can, then that would be absolutely fantastic. And so two weeks later, they sent me a message, it's okay, we're gonna do it. Would you like to be on the advisory board? And I said, okay, fine. And it has been moving. They do things. I'm very happy that they're doing it because they do things that we cannot possibly do. And I'm not convinced that they scale. They actually have humans going over each one. And they were actively moderating and rejecting say COVID-19 articles over the past two years that could have been public health hazards and menaces. And we just would not have had the labor to do it. And I'm not sure they're operating in a much smaller volume than us. And I just find it very, very useful to have competing resources in the ecosystem because there is the risk. I'm not exaggerating when I talk about how the software is sort of in place. And I was a young researcher when this started. I'm now on the brink of retirement and there needs to be an infusion of new ideas and some, I did it because I had an intuition for the way people were working. It's questionable as to whether since I'm not a Facebook user, I'm not a Twitter user, I don't even have a smartphone because I don't like all the people who spend all of their time tied to their ubiquitous mobile device as though by an umbilical cord. And so when I'm on a bicycle ride, I'm off the grid and I'm quite happy. Thank you very much. So it needs new people and it's great if these other things spring up. I think the ultimate resolution that you're actually looking for is why aren't these services more interoperable? You don't really care if they're run by different people with different sections elsewhere. I don't mean you aggressively, but you and I and all of us would just like to go to one unified interface and find everything we want. And so that's where the real effort should be federating these different resources so they interoperate seamlessly and you find what you want and how to submit to them. You could even imagine at the level of submissions these things being automatically routed and even with the right system to automatically route a certain number of these submissions to VIKRA archives spelled backwards. The final question about, I have nothing to say about publication fees except that sometimes these things can be misleading in high energy physics. We didn't have publication fees in the Elsevier journals in the 1980s and they became much bigger than the American Physical Society journals which had page fees. And what we were not aware of was how clever this methodology was because in fact, we were paying much more even though they seemed to be free. We were paying 10 or more times more via indirect costs on our grants channeled to the library paying for subscriptions. And so the fact that they were indirect rather than direct and we had no direct impact from them on them caused a disequilibrium of the system but as to what the publication fees are what they should be, why different methodologies are used it just goes back to my comment that we agree that quality control is necessary and I don't think archive could survive without as well without this invisible hand of somebody doing that work to curate to organize the curation of the literature independent of archive and it has to be funded somehow. And with the open access, open science push if you're not allowed to have a subscription income then the only place it can come from is from author charges. And the net effect of this was is that you learn part of what was potentially very clever about the subscription system which was you were distributing the costs over a much larger number of libraries who were serving readers. Whereas if you concentrate the costs on the people who are producing it you're concentrating those costs on a much smaller number of entities and so they pay more given that there are these fixed costs and it's good to clean out the system so that rampant profits unnecessary profits are no longer draining but even after you clean that up nonprofits are still going to need costs to support the infrastructure they have to come from somewhere. Great, so we have one question about diversity and inclusion. So the question is about these days there's a lot of discussion about diversity and inclusion in science. Do you know if there has been any particular analysis looking at the archives role in supporting this dialogue? And the second question is wondering if anyone has used data from the archive to study biases in science? I would love to be able to answer that question. The answer is I haven't and I don't know anybody who has and I'm not sure where I would start which is to say, which is not at all to say that it shouldn't be done. I think it would be an incredible thing to do but you would have to do a lot of what we would call ethnographic work on the side which is supplementing the data on the archive which of course is the data on the archive is intentionally not trying to highlight the nature of the diversity of the people who are either submitting the information or accessing the information. Let me tell another quick anecdote going back to the preprint era. In the early 1990s somebody from India mentioned to me that they were much happier with it because when they were sending preprints out on lower quality paper stock and so they felt, you know, think of it wasn't quite like a paper bag but it was close. And so they had the impression that all the preprints would appear on shelves and there were these things in eight and a half by 11 white paper that would get the attention and the ones that were smaller printed on lower quality paper were just being ignored and the comment was that this is so much of a level where everything appears in cold ASCII and we're no longer discriminated against on that basis. You know, so there are anecdotal aspects of that and everything about the interface is intended, you know, you can't hide the author names. We're not going to, you know, one of, there's still very potential biases involving, you know, people can still see institutions and authors and things like that. And, you know, we have had calls, especially from the CS community saying, you know, we submit our articles to conferences and double blind refereeing, we can't submit to archive because that would violate the anonymity that we treasure. And to that we just said, no, okay, that's fine. The answer is you don't submit to archive until you've been accepted or rejected by the conference. That's an easy one and sorry, you're not able to use it to stake a priority claim, but you're not permitted to stake an anonymous priority claim anyway. If the point is to be able to build on the research, people want to know whose research they're building on. So we're always going to have the names and there's potentially still an implicit bias in there. You know, the question might be, is there a question of bias, you know, among our moderators in, again, certain countries. And, you know, we have looked at the data on that and we don't find anything you could ask, should we have some external agency making sure, you know, that moderators aren't systematically trying to bounce as non-science articles from less developed countries. My own impression, and this is not based on a systematic survey, but on spot checks, it's just the opposite, that the moderators have been much more forgiving in the nature of diversity and inclusion. And you even see that explicit, you know, as a comment, oh, this looks, you know, this may be naive, but, you know, we give them a break, it's not harmful, and, you know, they deserve as much as everybody else, the dissemination, and to be up there along with the rest. But, you know, I don't know if systematic studies that were done, they would take, you know, somebody very serious and probably should be done by a sociologist or historian or sociologist of science, rather than a sociologist of science, rather than researchers. And, you know, I say this, you know, the question started with there's a lot of discussion of diversity and inclusion, and absolutely, and we have that all the time in the workplaces here. And, you know, I've always been a firm believer of it and don't feel that, and very much hope that such biases do not exist inadvertently in archive, but, you know, that's what everybody thinks, and, you know, you need somebody to check and make sure. I think Ateesh had, okay, Ateesh Ateesh. No, unless there are some other questions, Marco maybe, you should make sure that there are no other questions before. There was a comment about Hasenfelder. Yeah, there's plenty of other questions. I would go for one about the archive itself from our colleague, Marc Cello, saying that archive evolution is full of success stories. Is there an experiment that went wrong? And that has been instrumental in facing the challenges that archive faced? Yeah. Yes. And we're facing it right now. And it's what I've been dealing with for the past two days in this external advisory board meeting. So the ins and outs are very much on my mind. The experiment that went wrong was embedding it in the library. So when I moved from, I didn't say this and because I was trying to get through it quickly, that it started when I was at Los Alamos and operated for the first 10 years from 1991 through 2001 at the Los Alamos National Lab. And then in 2001, I moved from Los Alamos to Cornell. Incidentally, this move was for other reasons. It didn't have to do with archive. It had to do first of all with the work environment at Los Alamos having degraded due to spy scandals that had nothing to do with any of us but new restrictions put on the work environment together with the birth of my daughter in the year 2000 and I was gonna have to move anyway from this beautiful area in the middle of nowhere I was living in New Mexico to be closer to childcare and school. So it made sense to look around and Cornell where I had been a graduate student was a natural place to come back to it turned out. And I was thinking at the time what are the institutions that this was an ambitious goal but what are the institutions that are stable on the timeframe of centuries? And it's not the companies, the Google wasn't really even on the radar screen then believe it or not, but the IBMs, the Bell Labs and all of them, they or the big oil companies or the railroads, they tend to last on the many decades timeframe but not longer than that. On the other hand, the entities who tend to last for centuries are the professional societies and university libraries. I mean, the university library at Oxford has been there for since the 13th century or something and the Cornell University library will exist, it did exist a century ago, will exist a century from now and as well the professional societies, they're also in it for the long run. And it was an experiment embedding it within the library on the basis that this is what libraries do, they disseminate information to people. And it seemed early on like a natural fit but it became clear within a few years by 2005 after about five or six years that it was just an awkward fit for a number of reasons. Number one, the library thought this was just, they said, yeah, we run software and they were interpreting it as some kind of shrink-wrapped software that they would get from a vendor and then just be able to install it and turn it on. And this may not come as a shock to any of you but I am not a programmer. I can write highly functional code but I am not a programmer in the sense of I learned programming. My first language was software, my third software language was Fortran when I was in high school and I have never recovered from that experience where the code was go-tos and real programmers do not comment their code. So most of my code is inscrutable even to me a week after I've written it. And so the library was not set up to manage a developmental team and as well, no fault of theirs, it wasn't a stable landscape. The entire landscape in which it was operating was shifting every few years. Many of these resources I mentioned didn't exist in 2001 and so you were constantly having to adapt to a landscape being redefined from underneath and that was not what the library was set up to do. And so we had to keep supplementing, they were never able to take it over entirely. And then the other awkwardness of the fit was that, yes, libraries disseminate information but they disseminate information curated by third parties to their internal community. And it quickly became clear that archives purview was very different from that where the majority of the users were obviously outside that community and the information had to be curated internally. A lot of it coming from a lot of it materials of entirely uncertain provenance. And so they weren't set up to do the quality control and there were questions of why it should be funded by the library when the funding coming from the library was intended for the benefit of the internal university. And so they had to come up with a new funding model in the 2008, 2009 timeframe I think where there were members and they made voluntary contributions and it's not clear that's long-term sustainable since they're paying for things that would be freely accessible anyway. And so eventually in the memo of understanding that I had with the library that they would continue supporting it and keep it available as a freely available resource was not surprisingly only as good as the paper it was written on for the duration of the current university librarian. It was half abrogated when the next university library came in and then fully abrogated in the 2017, 18 timeframe when they brought in a new university librarian who said it's just too much of a nuisance, too high visibility and was moved out of the library in 2019 to the computer and information sciences directorate but has actually been a series of awkward messes with the dean having moved to a different job at Cornell Tech. I don't wanna go into all of the internal stuff but it's just left it in a very uncertain state. I mean, I'm not concerned about the long-term because there's nothing technically difficult about this but my inclination is that it's got to move forward into more of a partnership with that other entity I mentioned, the professional societies who ultimately have experienced much better aligned in ingesting material and disseminating it throughout the world and so that was an error I made it wasn't an obvious error at the time and it was only exposed over time. The reason I made that error was I was concerned that the professional societies had a conflict of interest and they did because many of them had their own publishing enterprises and they would not put the resources or attention and archive in preferences to their own resources. 20 years later, it's a different ball game it's much more powerful, they recognize it the community is much more empowered and I think that we could design it and plus I think they're operating in good faith I think in the discussions we've had so this is something these things won't happen immediately it's something that will play out over the next five to 10 years and I'm hoping eventually this error can be corrected to the benefit of the community. I do know that I will be retired in that timeframe so it won't be up to me. Plus I won't care. I'll care, I'm sorry actually. Okay Marco maybe. Yes please, I'll just go ahead. First of all, I think Paul thank you very much for a very interesting talk for the ICTP community it was great to hear from the pioneer of the archive. I actually have a number of questions. It's my first Zoom colloquium so I apologize for going on this much. No, no I think it's fine. In fact, I have many questions and I hope you will visit us some day so that we can discuss it more at length. Maybe there is one quick question. Nothing would make me happier to be able to get on a plane again. My last plane trip was to fly to Denver for the APS March meeting and to find out on the train from the airport to downtown that the meeting had been canceled. So I have actually one quick question. I mean, I know that it's probably a topic for a whole another colloquium so maybe you can just comment on it two minutes if you can. Also because I see that Professor. I'm sorry I can't. Okay, but okay, I will just say one quick longer. I see that about five o'clock from PsyPost is here and I think some of my CISA colleagues had dealt with J-HEP and this is probably a big question about how is there some, there are what are the difficulties in sort of one way, one reason why journals continue to be important is referring and community validation and their use in promotions and so on. And I'm sure this question has been discussed before and people have been thinking of alternatives. So do you want to comment on this shortly or briefly on what are the difficulties and in what way archive can be- No, actually I can do this under two minutes. I don't have a lot to say. I mean, PsyPost, all of these things fall under what I said, how important it is to bring new blood into thinking about this and to diversify the ecosystem. If there's one monolithic thing that controls everything, it's ultimately an impediment to progress. And so my only comment on that is I am absolutely enthusiastic about the experimentation and I personally, you know, am open to trying to coordinate in any way possible because I would love to, I mean, I think there are, as I said, there are just incredibly important unanswered questions on what's the right way to curate the literature and the right financial model for supporting it. And, you know, I think these are attempts in that direction to clarify it. And I don't trust the, you know, obviously the existing commercial players to do the right thing because they have too much financial at stake. And all of these, you know, organic ground up from the bottom up sort of resources are exactly along the lines of what I did with Archive and, you know, regard as the most auspicious for the future. So my only comment would be the more, the better. Okay, great. Thank you very much. So Marko should be, so Paul, I will join you very briefly with the students and then you are left in the company of the students to defend yourself all by yourself. So many questions, so few answers. So thank you very much, Paul, again. Thank you very much. It was great to have you even by Zoom. And I look forward to seeing you in person. Yes, we can discuss that. Thank you. Thank you. So now I have to find the next Zoom link. Yes. Okay, I think I have to. Maybe if you want to take, should we? Yeah, maybe you can take a short break for a couple of minutes and then try a minute. Okay. Thank you.