 welcome everyone welcome to San Antonio welcome to the spring 2013 CNI member meeting I'm delighted you've all made it here I hope that the trip has mostly been easy and not too disrupted by thunderstorms but I'm delighted you're here and I think you're in for a really interesting day and a half of sessions I'm Cliff Lynch the director of CNI and let me say before I forget for our new member representatives that I didn't have a chance to meet at the introduction for new members today sorry to miss you and please if you have an opportunity introduce yourself at some point over the course of the meeting and say hello to me the reason I wasn't able to be at the member orientation was we did a double dose of executive roundtables this morning scholarly identity being a very hot topic it turns out I'd like to take a moment to welcome our international visitors international travel these days is not as easy as it used to be and I appreciate all of you joining us we have a good representation of colleagues from outside the United States who are here with us I'd also like to take a moment to recognize four institutions who have joined or rejoined CNI since our last member meeting those are the University of Nevada at Reno Montana State University the University of Alberta and the John Hopkins University Press welcome all I will just remind you of a couple of things we have wireless if you need any help with that check with the registration desk there is a message board out there if we have any schedule changes and thus far at least we have not had any we will post them out there and the meeting rooms are actually on two levels the level you're on and then on the side of the hotel and immediately on the ground level not the not the Riverwalk level but the ground level like where taxis drop you off immediately underneath us so we are split across two floors here and there is a map in the in the meeting program of the meeting rooms if you have any trouble finding things with that said for introduction I am going to move immediately on to our speaker we have a very special plenary talk for you today and I'm just so pleased that Herbert was willing to do considerable violence to his personal schedule to join us here to do this because I think that there's a lot of important stuff in this talk and a tremendous amount of food for thought for all of us. Herbert van de Sample I know is well known to all of you he's had an amazingly distinguished career he has done all kinds of things you'll find his name associated with substantial number of the most important projects and networked information over the last 10 or 15 years and you can read his biography on the on the web I'm not going to take you through it I will say on a personal note that many years ago I was very privileged to be part of a special committee that was put together to review his PhD thesis that was a wonderful moment and a special treat and of course I followed his work even before then and have indeed had the privilege of working with him on a few of these projects at least in a very modest kind of advisory role so I'm going to just turn the podium over right now to Herbert who is going to give you a very high level and I think very thought-provoking reflection on a lot of the developments we've seen over the past decade and where they are likely to take us welcome thank you Cliff for your kind words looking at the size and caliber of the audience I'm now thinking I should have stayed on my road trip in Utah and Arizona I'm going to start with two apologies actually the first one is to those of you who were at the IDCC meeting in Amsterdam a couple weeks ago because it's going to be significant overlap with the presentation I did there and the second apology is about the fact that this talk will be a bit more technical than you can expect of a regular plenary talk but I guess Cliff was aware of that when he invited me so here we are there is something magical about CNI meetings in San Antonio, Texas for me because 13 about 13 years ago I did the first plenary talk here in a hotel not too far from here actually in that talk I actually presented about the protocol for metadata harvesting which I'll touch upon in a bit also it had just been released or was about to be released officially and in the second part of that talk I basically looked at new capabilities that the digital networked environment was offering us to transform scholarly communication to implement the core functions of scholarly communication in a different way so I'm talking about registration certification awareness archiving and rewarding and how all these things or a lot of these things had been implemented in a vertically integrated way in the journal system and how the digital environment was allowing us to rethink how these were implemented and start to think about implementing in a more modular kind of manner none of this has really happened but here we are we are starting to see some science out there especially with less traditional materials of more model implementations of the various functions of scientific communication call me lazy it took me four years to write down the ideas that I had presented in that plenary address in this paper that got published in the magazine I called it with a couple of friends from Cornell University including Sandy Piazz Carlegosi and Simeon Warner and so again this paper looked at fundamental changes that were happening in scholarly communication this ability to implement the functions of scholarly communication in a modular kind of way but it also observed something that was changing about the nature of the assets that were being communicated in the scholarly system and said that we're starting to move away from only communicating by means of PDFs or monographs and we're starting to introduce a whole range of less traditional assets in eScience and humanities endeavors and I'm obviously talking about basic sets and software ontologies workflows slides blogs and what have you not and my talk today will really be about these kind of materials and two core characteristics that distinguish them from the kind of materials that we were used to in the traditional scholarly communication system on the one hand these kind of materials are not atomic anymore like a scholarly article was rather you know they are all kind of different components that really belong together they're different parts that all relate to same research endeavor and that one way or another one needs to bundle up and that leads us to the notion of grouping assets that have a wide range of relationships and dependency as on each other the other core characteristic is that unlike the PDFs that had a very strong sense of fixity journal articles have very strong sense of fixity these kind of materials don't really have that and they are continuously in motion or at least for part of their lifetime are in motion and this leads us to the notion of versioning assets and how to go about doing that and so I'll talk about two technologies that I was involved in devising that tried to address some of the challenges that relate to these two characteristics one is ORE ORI object reuse and exchange that stands for that is about grouping assets and the other is memento that is about versioning assets but before I go there as promised at the outset I'm going to go back to the protocol for metadata harvesting and the reason I'm doing that is to point out the fundamental change that our perception of the web infrastructure has undergone in the past 15 years so first of all the open archives initiative if you may remember was a heroic effort to try and fundamentally change scholarly communication so we basically said let's start communicating by means of preprints by means of non-reviewed literature and in doing so basically subvert the existing scholarly communication system by established commercial journals again that didn't happen but out of this effort at least came a couple of interesting technologies so I looked at this problem domain from a technical perspective and it said in order to make this dream happen of communicating by preprints we are going to have to make them more easily to discover more easy to use and reuse now the way we went about achieving this goal is actually utterly interesting so first of all we looked at this as a problem of metadata exchange you need to remember this is 1998 when we start to tackle that problem and in those days a lot of digital libraries held metadata but not necessarily the content described by the metadata we were still in the face of uploading digital assets digitizing existing materials and so on the other consideration here was that most of the search engines that existed in those days were mainly focused on indexing HTML materials and not necessarily PDFs so hence this notion of okay let's focus on exchange of metadata second component that is interesting to observe if one were to tackle the problem of exposing making discoverable preprints today you would obviously come at it from the perspective of search engine optimization you would go out of your way to make sure that Google's and the likes would easily find your materials well we did not do so rather we had this whole notion of data providers at the red hand side that would expose their metadata and then service providers would just pop out of nowhere and they would start to create all kind of interesting services based on the exposed metadata what is really going on here is the fact that at the moment in time that we devised a protocol we did not consider search engines to be an integral part of the web infrastructure we just did not look at them that way which is very different with the situation today third aspect and from my perspective one of the most interesting ones is purely technical so again this is 1998 and obviously HTTP exists and the web is starting to bloom but what exactly HTTP means and what exactly the web infrastructure means is not yet very clear to us at that moment Roy Fielding had not yet published his thesis about the rest principles the web architecture had only been defined in 2004 so that was like six years later than when we started this effort and so these whole notions that are now core to the web architecture resources you arise that identify resources representations of resources were not known to us yet and so here we are with a fuzzy understanding of what the web really means and we devising this protocol and so just look at the first part there there's something wrong with the slide here I'm kind of losing my background but the first part here basically says we're defining a protocol in the abstract and then we're going to tell you how to instantiate it on HTTP so what this really says is we're not trusting HTTP remember we had just lost gopher and we were recovering from that and now we didn't trust that HTTP was going to you know stay alive for five or ten years so we defined the protocol in the abstract and say okay let's now show the people how to do that in HTTP the second one shows a request for a metadata record and you see the verb in there verb equals get record so get me this thing well today you would do that simply by an HTTP get on the metadata record you wouldn't put a get verb get record in there same thing with this notion of an incomplete list so in the protocol for metadata harvesting you have the notion that a server can send a huge list back to to the client too big for the client to consume so what do we do we introduce a paging mechanism so that the client can get these this large list piecemeal the way that is implemented is by means of a token the server sends a token to the client and in the follow-up request the client resubmits that token and the server gives back the second part of the next part of the incomplete list clearly something one would implement today by means of a link as is done in the atom protocol all of this to say that in those days the web infrastructure was not very well known we didn't really understand what was going on and when tackling interoperability problems what we really did is we were piggy-bagging on the web infrastructure we said okay of course we're going to use a ctp because everyone is using it now but we were still defining interoperability as something between those two parties up there leaving all the rest you know of parties that were on the web out of the interoperability game so the actual radius of the interoperability defined in this kind of way was relatively small let me get back to those two technologies that I mentioned are addressing these challenges related to the emerging compoundness of scholarly assets and then later on the versioning of assets so the first one is ORI and the consideration there is that scholarly assets are no longer atomic but are becoming compound and consist of multiple resources with a variety of relationships and interdependencies and then the question that ORI tries to answer is how are we going to convey this compoundness in a machine actionable kind of manner how are we going to say these resources are part of this compound object and these out there are not in order to motivate the problem it's actually enough to look at a very simple example which is a splash page in this case here from the physics archive and what you see going on here is already this compound so starts off with the splash page right and the splash page as its own identifier now that's not the identifier of the asset themselves because those are the post crits PDFs etc that are listed here there's several versions available of this thing several identifiers pertaining to this thing and relationships with all kind of objects so you look at this as a human and of course you understand what's going on we can interpret this stuff a machine cannot and so ORI was about how are we going to express to a machine that you know this part here this is all part of the object and this is actually not part of the object these are just relationships between the objects and other kind of things that are outside of the object we started this endeavor in a way and with now talking 2006 in a way that is extremely similar to how we approach the PMH problem so you recognize a similarity I hope with the picture I've shown you earlier you have here let's say the repositories that are the data providers and here you have a service provider and these things here are the compound objects and then we would have interfaces in front of these repositories not only a harvest interface but also an obtain which is an HTTP get of course and a put which is an HTTP put so we looked at the problem again from a repository centric kind of way fast forward the clock with less than a year and suddenly this entire perspective has fundamentally changed and rather than looking at this problem of compounders from the perspective of repositories or digital libraries we're looking at it from the perspective of the web graph and suddenly we say well in the end all these resources that make up a compound object they reside on the web and the problem that we really need to tackle is how to express that these resources here belong together and that these resources belong together and that these ones belong together and so basically how do we add information to the web graph that says that these things are connected related and belong together the other thing that's going on here and you see that in the title is that we now start to really embrace the fact that web crawlers and hence search engines are an integral part of the infrastructure so you start to look at this interoperability problem from a very different perspective and suddenly we are no longer talking about compound objects in digital repositories we're talking about aggregations of web resources and the solution that is being proposed to tackle the challenge is just publish a machine readable document out there that states that these things belong together that they are related to each other and so on once you take that fundamental shift of no longer looking at the problem from a repository perspective but from a web perspective you suddenly have an arsenal of concepts well-defined concept and tools at your disposition to start and tackle the problem obviously these from the architecture of the worldwide web the primitives a resource a resource identified by your eye and the resource has a representation so these are tools that we can work with to define and solve our problem and then if we're going to talk to machines and convey information to machines we're going to use the resource description framework by the way I highly recommend the presentation that Rob Sanderson will do about this later this this afternoon I think so RDF you arise again of course and vocabularies to help solve your specific problem so these are the tools of the trade now and what you see at the end of two years of thinking and working hard on this problem of compoundness the solution is as simple as what you see depicted here at the right-hand side you see these resources and to some for someone these seem to belong together three resources with a your eyes so what you're doing is you introduce a new resource into the web graph and that web graph that resource has its own identifier and stands for the union of these and then we say okay we're now going to put out the document on the web that describes exactly what you see here it states that there is this resource that is an aggregation and it aggregates these things and this document is published on the web for crawlers and applications to discover so it comes from a completely different perspective than repositories this is a solution completely embedded in the web architecture so there's a fundamental shift here in the thinking for me this was a very big moment in my career as a matter of fact to be able to make that shift away from thinking about problems in terms of digital libraries and thinking starting to think of them in terms of the web architecture and the tools that the web gives us I know for other people that were involved in this effort it was a similar eye opener so you use the notions that are defined in the web architect to resource your eye representations as the tool of the trade and suddenly by doing that you get the fact of integration with other web applications and you get the potential of adoption beyond the community that you initially devised a solution for this as a matter of fact has happened or e was conceived from the perspective of scholarly communication and we see it now being adopted in culture heritage kind of environments like your piano and the digital public library of America or e is one of the highly used vocabularies now in the link data environment and that's all of the result of taking the shift towards web architecture what brought it home for me that we were on the right track here was an experiment that I did with my team in 2007 it is described in the paper that you see cited at the bottom there what we did is we ran a simulation basically and we pretend that there were two publishers of aggregations of compounds objects out there and we said well these aggregations consist of web resources and resources on the web they change and that means that the aggregations change and as a matter of fact they change in two different ways on the one hand existing resources can leave an aggregation and new resources can enter an aggregation and on the other hand the resources that are part of the aggregation can themselves evolve over time so we're thinking well how are we even going to know so there's this aggregation how are we going to know what this thing looked like at a certain moment in time how are we going to be able to look back at that well the only thing we had to do was take existing tools web archiving tools and web crawling tools so heretics and the way back machine basically to archive these aggregations and the aggregated resources just using existing tools and infrastructure and doing that we were able to look back at the state of these aggregations as they changed over time all of that was given to us for free because we have used the tools of the web interoperability trade we could just use of the shelf technologies to deal with this problem going back to the little picture I've shown you before this is how I would characterize this new perspective on interoperability these two parties up there need to interoperate and rather than say well we're going to piggyback on the web infrastructure and we're going to interoperate with each other they now interoperate with the web infrastructure itself directly and that of course suddenly opens up the action radius of your interoperability effort very significantly because all these parties at the bottom that also leverage the web infrastructure now become in scope become potential adopters of whatever you have done it's not only technology that presents challenges when you deal with this kind of aggregations compound objects as a matter of fact what is really going on here is what this example that I gave you of the splash page with the physics archive is actually not the very best example be because in that case basically all the resources that are being aggregated are part of the same repository in many cases that is not true of these compound objects and as a matter of fact there's a repository that mints in aggregation but the aggregated resources live in a wide variety of environments and they you know these environments operate under very different regimes technically legally socially etc etc and so you start to deal with these challenges of well what the stewardship of such a compound object even mean in an environment where all the components of your abject are distributed in these different environments and what does excess and excess right mean how meaningful is an aggregation when each of these components have different excess rights does that even make sense or should all of these resources be under the same kind of license hopefully a liberal kind of excess license this reminds me of a very similar kind of challenge that you see in the link data environments their people are reaching the conclusion that when you're going to merge different link data sets it's not enough that they're available under a liberal creative commons license as a matter of fact that better be available all under the same license and hopefully even the cc0 because then you can really merge and really leverage the result of the merging of your content enough about to worry I'm now going to move on to memento memento is all about the web and time so this clear roots of memento in the experiment I described earlier where we looked at these evolving aggregations as a matter of fact most likely if we had not done that experiment related to or e about seeing how these aggregations evolve over time maybe we would never have ended up with the memento concept so the consideration here is that when you have a resource on the web and that resources of course identified by URI at any moment in time it is only possible to obtain the representation of the current state of that resource okay so you dereference a URI you get back what that resource is about at this very moment in time that introduces the question of well what if we have representations of prior states of that resource so archived states how are we going to access those and that's exactly what memento looks at it looks at this from the perspective of the web in general but I'm going to try and convince you in the next 10 or 15 minutes that memento also has consequences for scientific communication and that is in this realm of scholarly assets that do not have the same sense of fixity like PDFs and so on have and they are continuously on the move changing under our feet basically and even the traditional assets like journal articles that we know so well even those are already now becoming much more dynamic than was the case in the past simplest example it surfaces again to just look at the splash page and how dynamic and changing over time these have become they've become very rich environments that give all kinds of information pertaining to the asset described on the splash page where they used to only describe a little bit of metadata about that resource take it a little step further and you start you know going into publishing in open peer review and open commentary environments and now you see that an author submits a first version that is openly accessible there's some commentary next version commentary again etc and even that journal article you know that article starts to evolve over time same thing with web native authoring of scholarly papers this example here from auto rare very recently launched and created by Alberto Pepe who was on my team for a brief while here built into this environment where scholars can basically write their papers in a web browser is this whole notion of versioning of resources the mere fact that these things will not be stable but will evolve over time and can be shared with colleagues or with the web at large these are just examples of traditional kind of materials you take it to the other extreme and you arrive in this world that was described fantastically last year in the keynote by Carol gobel from Manchester University so this is the realm of scientific workflows and what you see here at the left-hand side is a workflow that is being submitted to the my experiment environment and this workflow actually calls upon services that are listed in the buy a catalog service registry those services themselves actually call databases that sit behind each of those the workflow is executed on a workflow engine this is an example of things that are not only continuously in motion but also highly interdependent on each other workflow change over time service descriptions change over time as does the database that sits behind them the workflow engine software changes over time the operating system that sits underneath changes over time so this is all continuously in flux this leads to the problem that when you run the workflow today and you run it three weeks from now you'll get different results and that is the problem now commonly known as reproducing the reproducibility of in silico signs I'm not going to go there because that's not my expertise the point here is however that when you have results coming out of running in such environments you will need to know what the state of these resources of these interdependent things were in order to be able to accurately interpret those results so you will need to know what was the state of this complex system when I ran it at that moment in time what was the state then etc etc to summarize the whole notion of fixity of scholarly communication assets is seriously challenged we are dealing with assets that are continuously evolving and that makes us to have to embrace basically the notion of the state of the scholarly record at a specific moment in time and hence the title of my presentation that is about the move from what we know as the version of record to a version of the scholarly record and essentially we will have the need to look at interdependent related assets that evolve over time and be able to determine what the state of these assets was at specific moment in time and this is actually where memento comes in so I'm going to give you a little description about memento so that you would understand later on how it can help in tackling this very significant challenge two perspectives I'll give you on memento the first thing that you see here is a CNN homepage from the day of the September so the September 11 attacks and this comes out of the web archive and the observation here is that this resource which is an archived resource has this URI here and we call that URI M for memento we call this thing a memento and that's a specific URI for this version and then we have another resource URI R which is CNN.com where you would have the current version of the CNN.com homepage so the archived version at the specific moment in time has a specific URI and there's a generic URI from which at any moment in time the current representation is available similar situation in content management systems like Wikis here you see the first the very first page in Wikipedia about the September 11 attacks and you see the same pattern you we have URI M that is URI an identifier for this specific version and then you have a generic URI from which at any moment in time you can obtain the current version of that resource I will depict this and there's going to be a lot of these kind of pictures from now on so you better kind of get used to them so we have time as it evolves here and what we see here is different mementos and basically there's a resource for every version and what we see we Tim Berners-Lee calls these time-specific resources and it says here that the M0 memento is alive from time zero to time one and then from time one to time two we have this one time two to time three we have this one so these are version-specific URIs and then you have the notion of a time generic resource and that's basically the one at the bottom and that's the one from which at any moment in time you can obtain the current version so if we look at it today let's say we obtain this version if you look at it at T1 we get this one which would have been the same content that this a team zero time zero this one here that would have been the same as this this was not invented by memento it has you know this is a pattern that you see all over the web memento just leverages this pattern here is an example taken from the architecture of the worldwide web document and what you see here is the latest version here is the URI for the latest version will always be available at this URI this version so the current version also has a version specific URI and the previous version is at this URI so that's the pattern that I was just describing and so for memento the question now becomes given this kind of pattern of version specific URIs and the generic URI how do I get from this generic URI to a version that was available at a certain moment in time why would it be of interest to be able to do that well because this URI at the bottom here this generic URI that's the one I'm going to send you in an email and that's the one that I'm going to bookmark and if we talk about scholarly communication that's the one that I'm going to put in a reference as a matter of fact probably with the date stamp next to it like as accessed on this date and so that would lead us to wanting to use this generic URI to reference the resource a long time alongside with the time indication and then we would try to go from here basically to an appropriate version of the resource and this is actually exactly what memento does and it achieves this again by leveraging the primitives of the web architecture resource URI state representation link and as a matter of fact also content negotiation so here's how it works and I'm not going to go in a lot of technical detail the consideration is that this resource and these version resources may reside on different systems as is the case with web archives so CNN.com lives here and the web archives that have old versions of CNN.com they sit up there so this resource down here doesn't really know about the details of the past the archived versions that live in another system but what we say is well there is a resource this system clearly knows about all of these mementos so we're introducing a resource that actually is aware of the past of that one and we allow navigation from the generic resource to this new resource which we call a time gate so that time gate knows about the past of this resource so client a web client can now follow that link to the time gate and there it uses content negotiation in the daytime dimension to obtain an appropriate version so meaning the version that was active at a certain moment in time for those who are not aware of this content negotiation is built into the HTTP protocol it is something that your browser does all the time it will express a preference in our case of English over French of HTML over PDF it just provides this information as it talks to a server and if the server can honor these preferences it will actually do it HTTP defines content negotiation for a couple of dimensions including language and media type and we introduce it for daytime so that's exactly what's going on here to cut a long story short what memento allows you to do is look at a website as it is today and that's the CNN the CNI org page of today you use a browser that is equipped with a memento compatible tool you select the daytime of the past and automatically the protocol will lead you to an appropriate version of that page of around the date you selected not necessarily exactly the date that you selected because that may just not be available in any web archives so what you get back is obviously subject to what is available out there in the archives but that's basically what memento allows you to do so now back to this whole notion of fixity is challenged and can we reconstruct the state of you know these interdependent and interrelated assets as they were at a certain moment in time so here we go again with one of these pictures so I see timeline here and what we have here now are two resources that are interdependent and related and they are progressing over time and what I'm saying here is we well let's talk first about the simple case it's a discrete progression this basically means these are more or less traditional scholarly communication assets and there are humans involved in deciding you know oh yeah there was enough change to this version let's now memento new version and you know there's a board of directors that says oh yeah this one look at this there's a substantial change here let's create another version so we have this human kind of decision-making in minting new versions of this resource well if these versions would reside on systems that are memento compatible then I could exactly recreate the state of these interdependent resources as they were at time I because I used the memento protocol to dereference each of these URIs subject to time and I actually arrive in this case here at this URI and in this case at this URI and basically I know that the TI this one and this one belong together so I've kind of connected them back in time there but what about when there is no such thing as editorial control and people that sit there and decide there is a new version available so there is kind of a more continuous progression of these resources how in that case are we going to reconstruct the state of that system so let's look at the specific case of that and we're going to look at the case of a paper that references all the resources and we're going to see whether we can reconstruct the cited context of this paper as it was at the moment that paper was actually cited so we have a special case of the general problem that I was describing the time I is the time of publication and the relationship between these resources is that they were all cited by the same thing and the paper I'm going to use is the one that I talked about earlier so here we go this is the reference list and this paper was published on September 15 2004 so what we would like to see is for these URIs that are being referenced here for there to be an archived copy of these things around the time of publication so let's look at the first one there's something going wrong here I don't know what it but it's a paper by actually it's the cyber infrastructure report by Dan Atkins we dereference that URI not only is that resource no longer available that domain doesn't even exist anymore this was a fundamental report on cyber infrastructure by the way we now use memento and we arrive at the copy of that resource in a web archive from December 5 2003 so not very close to the daytime of publication of the citing paper but something nevertheless the second one up is a paper by the people from Southampton and the current version as a matter of fact still exists in their institutional repository we use memento and we find an archived version very close actually to the daytime of publication of the citing paper third one up is actually a paper by Cliff Lynch about institutional repositories it's one that was put out on the ARL in the ARL series we dereference that URI and it's gone yeah this is ARL for you and then fortunately with memento we find an archived copy of the end of 2003 and then the last one I think I'm looking at is a paper by Sandy Payette and Tony Staples this is about the melon fedora project the resource is gone when we dereference it so there's nothing there anymore and unfortunately there's nothing in web archive so memento doesn't find it back either all of this is of course anecdotal evidence Rob Sanderson led an experiment a pilot study about how this is you know in real life by looking at URIs that are listed in reference list but also in the body of text of the archive and in UNT repositories basically looked at does the reference resource still exist are there archived versions of the reference resource and if so are there archived resources from around the time of the publication of the citing paper the findings are very similar to what I showed you in the anecdotal evidence and so there's good news and there's bad news the good news is that despite there not being any kind of proactive effort to archive these reference resources some of them actually quite a considerable number of them were found in web archives why is that because there is continuous web archiving going on there's just these processes web archives are sending this their crawlers out and because resources have URIs they get archive for free so this accidental archiving that comes for free in the web infrastructure the other observation that I'd like to make here we saw that memento enables us to find resources back even if they currently do not longer exist and hence respond with the 404 this makes me ask the question if we consider web archives to be part of the web infrastructure and we add the memento as a protocol to the mix does that make HTTP URIs potential candidates for persistent identifiers the bad news coming out of both the anecdotal evidence and the experiment that Rob did is that many resources were not archived and for very few resources were there archived copies from around the time of publication so to summarize going back to our picture yes to a certain extent we were able to recreate the state of these interdependent resources at a certain moment in time but to a large extent we were not we were able to do so due only to accidental archiving that tells me basically that probably we should start to become more proactive in archiving these continuously evolving resources that are now part of the scholarly record and how to do that and who is going to do that and when are we going to do that would actually be subject to a very interesting conversation there's lots of options the resources can be self-archiving for example if you use a content management system a wiki a data wiki with solid versioning mechanisms then as a matter of fact you know these things get active actively archived as they change over time transactional web archives are another solution transactional web archives basically save a copy of every representation that is delivered to a client you can rely on resource archiving via web archives as was shown in the example accidental archiving so you basically do nothing you sit back and you relax and some of your resources will actually get archived or you can have an on-demand archiving kind of approach where you subscribe to web archiving service that comes in and regularly crawls your environment you can think of archiving resources at significant moments in their life cycle for example when i submit the paper or when i publish a paper we're going to collect all the reference resources and archive those or you can think about archiving resources at the moment they are being interacted with after they are being put out there for example when the social network interaction with them when they are being downloaded annotated and so on a lot of options as i said and a lot of food for thought and debate cliff how much more time do i have okay so yeah so i talked an awful lot about technology and machines and i want to end on the more human note and i want to observe that not only did the scholarly communication system change by means of the kind of materials that are added to it and their characteristics but as a matter of fact the way the contributors to the scholarly communication system exist and the very nature has changed also they are coming basically and i'll show you a couple slides they used to be in the periphery as i see it of the scholarly record and they're now taking central web stage and this is enabled by online identities that these contributors are getting in various portals and social networks they make contributions by depositing assets in these kind of social networks and i don't mean necessarily twitter or google plus when i talk about social networks i'm also talking about things like a slide share and fig share and github and what have you not and so we have these portals that then link the assets that are being contributed with the contributor identity and in doing so one can now start to derive metrics pertaining to these contributed assets and hence also pertaining to the contributor and so if i were to characterize the previous environment it would be like this we had really the journals at the center stage and they had their is as n numbers and y's would have you not and articles get published in journals and oh yeah there's actually authors for these look at that and you know they actually were identified by their very ambiguous names right and they derived all their credit indirectly as a matter of fact very indirectly from things like impact factors computed at the level of journals not at the level of things they really contributed and you fast forward and suddenly there's a big shift suddenly we see our contributors at the center identified in different environments where the cdp you arise with orchid identifies what have you not they interact in social environments they create all kind of assets that live in a wide variety of portals and you know what they actually can get credit for each of these contributions because all of these environments are actually counting and creating metrics about what these people are contributing so it's a fundamental shift of how the contributors reside in this new web-based scholarly communication system this was actually observed as one of the big trends one of the seven predictions for the future of research in this document by jiskin form research is fully embrace social media and again by social media we are not only talking about twitter but also all these environments in which people are submitting scholarly assets this leads me to the notion of surface your scholars and this of course is a bit of a paraphrase to what lord condom lord condoms he is always talking about when he talks about the inside outside library he talks about surfacing library materials to the web so what i'm talking about here is basically institutions that should really surface their scholars in these environments they should actively promote that their scholars are taking active part in these professional academic asset oriented portals such as linkedin mandalay slide share my experience get up etc etc the reason being that that will increase not only the visibility of these scholars but also the visibility of these institutions it will not only give us more metrics about the scholars it will give us more metrics about their institutions because eventually all these metrics will be aggregated to the level of their departments and their institutions just like they were with the impact factor the same thing will happen with these alternative metrics that we see pop out all over the place now this reminds me of a paper i came across recently by colleagues from the university of sydney what they did here is they basically created a ranking of universities worldwide purely by looking at dbp the link structure of a university and the people that have studied at the university the kind of degrees they got the kind of awards they got etc etc purely looking at the two-degree link structure of these wikipedia dbp the pages and the types of link relationships they obtain a ranking of institutions that is as a matter of fact highly correlated with the existing ones coming out of the times higher education and the ones that i'm going to blank the name on that comes out of china purely looking at open data purely looking at web-based information and a little bit of smart computation okay i'm going to round up basically what i did is i looked at a 15 year evolution and in those 15 years we have gone from a very fuzzy understanding of the web infrastructure to a true understanding of what the web really means we have gone from a notion of not really knowing how to deal with interoperability in the web to fundamentally leverage the web infrastructure in our interoperability efforts in that same time period we have moved from a scholarly communication system that was based on a stack of journals and a bunch of pdf files to a network of interconnected assets and actors what i've tried to show you by means of ori and memento is that it is actually possible to tackle fundamental challenges related to the new scholarly communication environment by leveraging the web infrastructure and its primitives other efforts that i'm related that i'm involved in like resourcing and open annotation are actually going along exactly the same path if we can do it for these efforts and for the challenges they tackle i dare say that probably we can do it for a lot of other challenges that we face in scholarly communication such as certification archiving persistence trust metrics and what have you not and try to leverage the web infrastructure to tackle those challenges also the wins are obvious long-term sustainability because you're going to use technologies and tools that are widely used by others around you and at a more philosophical level we'll achieve a tight integration of the scholarly discourse with other web-based discourse that happens out there now with your permission i would like to go back to this but maybe that's time for a few questions cliff thank you we have time it seems for one or two questions thank you very much particularly for talking about how our way of looking at this world has changed and our new kind of concepts and and also for your your work on memento which i really feel is groundbreaking very interesting can you speculate on how one might extend some of the things you're talking about into other areas that are at least in my view quite problematic for example i mean you're talking about material once it is out on the open web there one big problem we have now is that the versions of someone's book or manuscript or whatever we only get the last version don't get the previous versions those are not on the open web but they may actually be on the web you know and they're shared by people you know i write something i put it up i share it with others is there some way we could leverage with file naming with other things to kind of extend it into that area okay i think i understand the question which in essence is about resources get minted and they evolve over time but not necessarily is that entire evolution publicly visible some of it maybe and some of it may not be if i would answer this question from the memento perspective i would say that the memento concepts still hold irrespective of the fact whether some of these things are not visible and some are visible the problem becomes of course one and i see a david right behind you of access rights and how to deal with that and as you know in a lot of efforts we have this notion of separation of concerns so you first tackle one problem and then you say well authorization and authentication that's for someone else okay so but there's no need not to look at the repository i'm sorry at resources that reside in a closed environment in the same way as you look at them in an open environment the same versioning mechanisms the same you should be your eye schemes the same notions of memento daytime negotiation could apply the question becomes one of access rights and that should be or at least in my perspective is orthogonal to the versioning and the memento protocol as such so if you have access rights to that resource that sits behind the firewall then you will actually get there if not you'll get a nasty kind of error message probably just as it is out there currently yes okay yes so that's why but probably i went a bit fast um this whole notion of trying to do that not in a human mediated way but in a machine mediated way and hence my suggestion to start thinking about using content management system with strong versioning mechanisms wiki's data wiki's because those will automatically take care of all of that and if then you want to make those memento compliant it's about you know basically developing a plugin for that system and you have the whole memento approach at your fingertips also so there are systems that can help you with doing that yeah David are you going to hurt me again no no uh hope and i actually disagree a little bit about that although i agree with most of what i what i wanted to do was to a lot of the enthusiasm about the revolution in in scientific communication is about metrics and you bought that up and that's very important and one of the lessons that you know i've worked at three successful startups one of the lessons is that management likes to measure stuff because then they can improve it and what they end up doing is gaming the system and uh i want to draw attention to a wonderful paper by brems and munaf called deep impact unintended consequences of journal ranking what they basically show is that gaming the system is uh endemic uh high ranking journals are quite strong predictors of malpractice in various ways and the reason is that people are gaming the system and the caution that i want to put out there is actually their argument applies to any form of ranking at all including all these metrics that people are getting so enthusiastic about what we're going to do is to transform one kind of gaming the system into another kind of gaming the system and we have the awful warning out there about gaming the system which is called search engine optimization it's going to happen in in scientific communication it's going to happen in archiving all these things once you get something out there that has these very strong metrics people are going to gain the system and uh i just want to um dampen down the enthusiasm a little bit because we need to provide mechanisms for fighting back against gaming the system these need to be not layered on top of them because layered on top of the protocols because that's generally not that very effective they need to be thought through and implemented actually in in in the underlying protocols it's the big mistake we're made with with google was this the fighting back against the search engine optimization is layered on top yeah so just a quick reaction i very much agree of course with basically everything you say david i however find the good news that we come from an environment in which we had one metric at our disposition which was the impact factor which could be gained also and actually has been abused to very large extent in many different kind of ways and we're now moving to an environment where we have a whole range of metrics available to us in both cases you need to use them with a grain of sound and be very aware of the dangers that you know are involved in in interpreting these things another thing i'd like to add and this is about the whole realm of altmetrics is that from my perspective it is a bit too early to kind of start to police them i would love to see a lot more chaos initially in exploring the consequences and the possibilities in that realm in order to maybe in a couple of years from now come to some kind of consequence consistency new metrics that we could all agree on and there will be a whole lot of debate in that realm i'm sure all right well thank you again for your attention thanks with that i wish you a good series of breakout sessions and i'll see you all at the reception um herbert thank you again we will send you back to your um your desert and your vehicle but really appreciate uh you sharing your thinking with us on that thank you again