 So, first of all, I'd like to give a brief overview of the framework and how it works for allowing access to the past web. Then we'll go to how deployment is coming along, how it's been used for data discovery of mimeontos, branding of archives, and finally some alternates of web archiving strategies that we've been researching. So, the middle two, to throw back to Christine's brilliant talk earlier, really we feel come into the same sorts of areas that Christine was talking about, in terms of reproducibility, access to dynamic data, and so forth. OK, so a brief overview of the framework. So the main goal of mimeonto is to make it easy to access the web of the past. So say you're on the Tate Online site in the UK, you might then want to select a date that you want to see it at rather than the current version, and magically you arrive back into 2008 in the state that the website was at that point. And in this case, we've gotten it from the National Archives. So in this case, we have the reverse trend from what Christine was talking about, where instead of going from static to dynamic, we want to take that dynamic content and go back to previous static snapshots. And we want to do that automatically. We achieve this by introducing a uniform version access capability to the web that allows you to specify in one common way the time that you want to see the website at. So the web has various content management systems. These are designed to be aware of all of the versions of a resource that kind of self-contained. They have a variety of proprietary version linking mechanisms, interlinking mechanisms. Of course, this has the great advantage that the dynamism is managed. However, the web in general is completely the opposite. It's designed to forget about prior versions of a resource. It's very distributed, and hence the management of the whole thing is an impossibility. So we have a pretty hard task in order to marry those two. But of course, various resource versions online in various content management systems, such as media wiki or WordPress, in web archives such as Brewster's Disruptive Internet Archive, the UK Web Archive, Archive It and so forth, transactional archives that we'll talk about a bit later in the alternate or alternative web archiving strategies, and of course in search engine cases. But at an architectural level, the web has a very hard time dealing with these things. You can't talk about a resource as it used to be. You can only talk about the resource as it is now. The web is always in the present and never talking about the past. If you know the current URI for a resource, you can't go from there automatically back to the prior state of the resource. And given a prior state of the resource, you can't get from there back to the one that it was archived and archived of. So there are of course various approaches for this, but they're all very localised in ad hoc. The Internet Archive isn't a standard. It doesn't obey the rules of the web architecture directly and so forth. It's one centralised archive. So this is what we want to try and change. So we regard the web as a single large content management system. So what we need to do is introduce a single, uniform way of accessing versions of resources across the web. To do that, we're not going to reinvent the wheel and try to compete with the Internet Archive, of course, but instead leverage all of the systems in which archived or the past state snapshots of resources are available. So that includes web archives, of course, but also content management systems such as media wiki, software version systems and so forth. Memento's approach in this regards ticks almost all of the boxes that Christine brought up. It's distributed so versions can exist on several servers around the place. We use time as our global version indicator to say that something is version 1.0 or 1.2 really doesn't mean anything in a global sense. So the only way that we can make use of any sort of indicator is to have the global time. And it's based on the primitives of the World Wide Web, the state of the resource, which is a representation, content negotiation and linking between resources. So this is the current state of the web, essentially. We have this grey ball on the left, which is CNN, and that's the current state of the CNN homepage. In the Internet Archive, of course, there's a whole bunch of different versions of the CNN homepage, such as from April 2001, August 2007 and so forth. And there's no connection between those resources. You can't get from CNN to the Internet Archive, and unless you deconstruct the URI, you can't get from the Internet Archive to CNN. So what we want to do is introduce something in the middle that allows us to make those negotiations. We do that by introducing a resource that we call a time gate. As you can see, we've had a lot of fun coming up with names for things, and when we at Los Alamos announced that we'd produced a time gate for the World Wide Web, we got rather a lot of unintended publicity as well as intended publicity. So this time gate performs something called content negotiation, which happens all the time with browsers. So if you go to one site on your iPhone versus on your PC and you get a different site, that's content negotiation. And we make use of that capability to content negotiate in time instead of format. So you go to CNN, there's a link from CNN to the time gate. The time gate then performs the content negotiation based on time, and you end up in the Internet Archive. This is the bridge from the present to the past. We also want to have a bridge from the past to the present without knowing exactly when the lightning bolt's going to hit the steeple so you can have the DeLorean going at 88 miles an hour with the hook. So without going into that much detail, there's a link back from the archive resources through to the current representation at CNN.com. And that's it. That's the entire framework that allows us to get from the current resources through to their archived state. One thing that's really important to notice is that this solution works in a very distributed fashion. You can have multiple archives, all integrated within the same system, such as the Internet Archive, Archive It, all of the national libraries have archives and so forth. Our websiteation would be another example. So in this case, when we're building up this CNN home page, we might use that HTTP link to the time gate, then from there, content negotiate to find the best resource as it appears in all of the different archives rather than just going to the Internet Archive and hoping it's got the most appropriate copies. So if you didn't follow any of that, that's fine. Here is the one-page or one-slide memento interaction summary that you can hopefully take home. So first of all, the browser goes to the resource that it's interested in, which we call the original resource, and it says, do you have a preferred time gate? Do you know where to go to find old versions of yourself? It might say yes. Please go and see this resource that we call G. But in the current state of the web, it's very likely to say, no idea what you're talking about, you should use a default. So after selecting or being offered a time gate, the browser then goes to that time gate and says, where's the archived copy of the resource for the time that I want? The time gate then says, I know about that resource, it's at M, or I really don't know what you're talking about, you should try another time gate. And at that point, you would then cycle through the time gates that you know about, looking for one that will recognise the resource. Then finally, you'd go to the archived copy and say, hi, I want a copy of you, and it delivers it back. So that is memento in a nutshell. So what I want to talk about for the rest of the presentation is where we've gone from this, how we're using it, how it's being deployed. So we've been really lucky with deployment and we've made really significant progress towards getting this implemented and used. First of all, we're going through the standardisation process with the IETF. On the left, you can see a screenshot of the current internet draft. And we're going to release a new version before too long to try and get some more feedback before we go to the final call for that. We've had a lot of interest from both the IETF and the W3C, including people like Tim Berners-Lee, Mark Nottingham and Michael Hausenblass. On the client side, we have various memento-capable clients that have been developed by both ourselves and others. We have an add-on for Firefox called memento-fox, which is fully operational and has been endorsed by Mozilla, and an experimental version for Internet Explorer. There's also been mobile device support that has been developed by others, including an operational memento-client for Android and an iPhone iPad version, which is currently being developed. We discussed our work on this in the next version of the Code for Lib General, which hopefully should be out this month. Thanks to Chris Carpenter, who I see. We have been very fortunate to get the support of the Internet Archive, of course, Bruce DeKal. So the Internet Archive is now memento-compliant. That means that anything in the Internet Archive can be accessed through one of our browser add-ons. It's also been released as version 1.6 to the web archiving community. So as web archives upgrade, they will become memento-compliant. So we hope that we'll be able to apply a little bit of pressure at the IAPC meeting next month to try and get people to move to that version. So if you have an Internet Archive, or a web archive, I should say, which you're responsible for or related to, please have them upgrade to 1.6. It doesn't just do memento, it also has a whole bunch of other impressive features. On the server side, we also have an operational plugin for MediaWiki, which is the platform that Wikipedia uses as one of many examples. It's also been installed on the W3C wiki, which makes it memento-compatible. So you can go back and see all of the old versions of the pages using the memento clients. So again, if you have a MediaWiki somewhere, please install the plugin. You can get it from the tools you are right there or directly from the MediaWiki website. We have a validator, one of the things which a lot of people have been asking us about is, hey, we really like memento. We think we've done it right, but we're not sure. Can you have a look at that? So we put up this automated tool that goes off and acts like a memento client and tries to do all of the interactions that a regular client would do. So you give it a URI for a time gates, for an original resource, or for a memento, and it will go through and report the success or failure for each of the different sorts of interactions that a regular memento client would do. And this validator is then kept up-to-date with the IETF internet draft. So when we make a change to the internet draft, we then update it in the validator to make sure that everyone's on the same page. We realise that it's going to be quite a while before everyone becomes memento compliant. So we also have made quite a lot of progress and proxy support. So this is a system, or these are systems, I should say, that are hosted by Los Alamos and Old Dominion, which make internet sites that have snapshots of versions memento compliant by proxy. So the web archives are an obvious point, but also all wonderful. All of the media wiki systems that we know about are compliant by proxy, including things like Wikipedia and Wikia. Of course, we'd prefer that everything was natively supported memento. However, we realised that this is a stop-gap measure in order to make the websites available to memento clients by their bootstraps. Documentation work is ongoing. We try to include more information about introductions to understand the memento system, how to recognise a memento, a time-gator or an original resource just using the HTTP headers, guidelines for servers that host mementos, and so on and so forth. We're constantly trying to update that information. If there's anything that would be helpful in the guide, please let us know and we'll do our best to include it. As far as funding goes, we started off with a very small grant between 2007 and 2010, of which only about $50,000 went towards memento work, and it was more exploratory and scope-defining. We were then very fortunate to get a much larger grant of $1 million from the Library of Congress for 2010 and all of this year to do things like specification and hence the internet draft outreach, such as today, tool development for both clients and server, and further research that we'll get towards. OK, so memento and data. We feel that memento's time travel on the web possibilities are really powerful and not just for the human web, but also for the semantic web in which machines crawl around looking for data that they can interact with. This uses a process called following your nose in HTTP speak. So I'll give you a quick walk-through about how it will work for the human side, and then we'll try and relate that over to the data side for how a machine might use this. So that's the memento framework. You've got the current representation of CNN, obviously intended for humans. You can go to a time gate. The time gate then sends you to an archive copy, which is closest in time to where you want to be. So we have a pick of the day at Los Alamos, which is Herbert sending next to a folder on which we superimpose the current copy of the BBC. That's our original resource, which has the URI in the slide. You can then get from there to a time gate. The time gate will take you to the archive versions. From this, we can do some cute things by just following those redirections, following those interactions, and build a library, a little movie in which time goes forwards and Herbert's t-shirt changes, the date goes forwards and so on. This is cute, but not very useful because it's just a movie. Okay, enough of Herbert. So a really quickly inserted slide. Thank you very much for Christine for bringing this up. So as data becomes more dynamic and processes become more dynamic, reproducibility tends towards zero. However, if we had static and discoverable snapshots of both the data and the process, then we could have at least points in time in which reproducibility was possible. And that's what we want to do with momentum. So, for example, there is a linked data version of Wikipedia called DBpedia. They take the information from Wikipedia, scrape out everything that they can, put it into RDF and publish it, and they do this for every single page that has an info box in Wikipedia. They do this every six months and currently they throw away all of the previous versions. Thankfully, they don't actually throw them away. They just store them in a big dump file. So we can do momentum on them. So to walk through the same process, we start with the original resource, which is an RDF description of France. We follow the time gate link to the time gate. We then can negotiate in date time to say, I want the version of this data as it was at a particular point in time. We follow that link and we get the data that we're interested in. We can do that for all of the versions in order to build up time series data for processing. So, to practice what we preach, we did exactly that. This is a graph of the GDP per capita which we built up over time just using HTTP interactions with momentum. So here's the first version of DBpedia in September 2007 and then every six months they publish a new version. Of course, by that time, the data in Wikipedia has changed and hence we get a new data point. So, to date the United States is an example, it stays pretty constant through various versions in 2008 and November, sorry, in 2009 it jumps up to here and then remains static again. But then other graphs are more up and down. You can see the trends of the GDP per capita which we built up just from using momentum against the original resource to find the archived versions. This isn't a particularly useful to economists piece of information. I'm sure they have much, much better sources but you can imagine, take that out to the nth degree, you plot across all of the information in Wikipedia and hence DBpedia and it can become really powerful. OK, momentum and discovery. So another point which Christine made was we need to be able to discover things. We need to have these links between resources semantically typed in order to understand how we came from where we started. Very few sites currently provide this time gate link and hence we're going to need other methods in order to find momentum. One way that we are looking at doing this is a batch discovery process which we call a time map. This is an ORE aggregation and object reuse and exchange from OAI and it includes at least all of the URIs and date times of the archive copies of the resource that an individual archive knows about. These time maps can then be aggregated across systems in order to provide access from one point to resources that are spread all around the web. These aggregators could be in lots of different places. There could be one at OCLC, one at Los Alamos and so forth. The interesting point is then being able to take that information from an aggregator and aggregate it again into other resources. So before long using a bottom-up approach you'll end up with at least one aggregator that has access to all of the information which is available to momentum clients. The format for these time maps is quite simple. Although ORE uses RDF we've introduced a new much simpler style which is based on the link header format which has been recently standardised in the ITF. So ingesting this information is not difficult. OK, that's great. That means that we can find the mementos themselves, the archived copies, but how do we find the time maps that describe them? One way that we've been thinking about is to use an atom feed that instead of saying here are all of the mementos says here is the time maps for each of the mementos. This would allow applications to remain in sync with archives such as the aggregator which would redirect clients to the right place. In this case, we'd have one atom entry for each original resource. For each system that hosts mementos the entry would then provide this time map link to the time map that describes whereabouts each archived copy is and the time at which it was present as the original resource. At the IIPC meeting in the Hague in May we'll be hopefully going to discuss this. We've also been thinking about a batch process for discovery of mementos directly and one way that we've thought about that is to use the robots.txt file which is a file that servers use to convey their policies to crawling robots such as for the internet archive. So what we'd like to do there would be to add a directive that supports the discovery of mementos even if they're maybe not directly within that server. So for example, there's a whole bunch of conference sites for GCDL, the Joint Conference on Digital Libraries at GCDL 2011, GCDL 2010, GCDL 2001 and so on. All of those sites are archived on gcdl.org, the main conference website such as gcdl.org. So what we'd like to do is have some way that gcdl.org could say this resource is a memento for gcdl2002.org's homepage. And from there, a memento crawler can simply say OK, I know that's a memento if it has all of the appropriate links you can then find the time maps, crawl, the site and find all of the mementos and so forth. Hopefully then from there you'd be able to find an atom feed which has the links to the time maps or all of the resources and all of a sudden you have access to all of the information. Just again by following your nose starting from one point and going out and following semantically typed links. This method we're going to try and promote via the internet draft and hopefully we won't get too much backlash for adding another directive into robots.txt. OK, memento and branding. This is a topic which David Rosenthal also here has brought to our attention and we think it's critical to understand because if there's no incentive for archives because all of their resources are being displayed out of their own context how do they get funding, how will they promote themselves? So if you think about this union homepage there's HTML, there's a whole bunch of CSS there's a whole bunch of images, there's maybe some videos and they could all be recovered from different web archives. This poses a serious branding challenge for us. So currently if you go to the Wayback Machine, the internet archive you'll get this nice header at the top which is the Wayback Machine's branding for this particular page in this case, February 12th of 2010. This applies to the whole page because all of those resources came from that archive. However, because we're meant to have distributed and each resource stands alone it might decide that the most appropriate copy of this photo is actually an archive bit or in the UK web archive. However there's no branding associated with it. Even worse we might get the branding from the Wayback Machine that doesn't have anything to do with this photo. They might not have a copy of that just by the way that the crawler works. Or these images, or this double-click add and so on. This is one of the things that we're intending to research. We don't have any great new insights on this yet. Some of the proposed approaches have been to have a sidebar in which resources have their archives advertised or to have mouseovers that say that this resource comes from a particular web archive. I guess Herbert would have done this talk a lot more slowly than I have but onto the last one. Alternative web archiving strategies. Currently web archives are all crawl-based in which you have a robot that crawls the web, records what it sees and keeps it in archive. What we've been looking at is a process called the transactional archive in which the server cooperates with the archive to make sure that every single version of a resource gets stored. So here's the current crawl-based approach. You have the crawler so that could be hero tricks from the Internet Archive. That sends a request to a web server for a particular resource. It gets back that resource in the response which it stores somewhere in its very large storage array. This is great. However, the crawler only sees what it sees. It doesn't see resources that other browsers might see. It can be deceived by cloaking and it can have the geolocation stuff applied to it so you would only get the US version of the page whereas you would never see the French version of the page, for example. So what we're looking at is a server-side transactional archive in which the web server cooperates by sending each version of the resource to a submission system that then puts it into the storage array. It does this when any old browser triggers it. So this could be someone in Paris comes to the site and gets the French version and it stores the French version as opposed to the Internet Archive crawlers which will only ever see the American version. So this gives us a complete history of the changes of the site. There are some examples of this already that we're not the first people to do this. There's a product called TT Apache which is open source, Page Vault and Vignette Web Capture which are commercial products. So we've developed an Apache module called ModTA for transactional archive which captures the URI, the headers in the body of the response and then at the same time as returning it to the client it also posts it to the transactional archive's submission URI. So here's ModTA when it gets the request into the web server this copies the information over to our grisly-based web submission system which then stores it in the file system and the metadata in a Berkeley DV. At the same time, Apache also then sends it back to the browser and to the real user. So we've been running a transactional archive on that Pick of the Day site and also on the MumentoWeb.org site so if you've happened to have hit that up during the talk then you've populated a new instance of one of those pages in our file system. So that's the storage side. On the access side we of course natively support Mumento and hence if you go to MumentoWeb.org with a Mumento client you can go back and see all of those previous versions. The archive content is currently immediately available although in the release version we'll have the option for an embargo period and we support the export of the WAC files which is the standard for archiving and for transferring archived web resources. A development timeline currently we're finalising some of the development at Los Alamos such as the embargo periods and so forth, some of the configuration options and this is being tested both at ODU and with some other partners. The submission and access is finalised and as I said the development focus is going to be on collection management and configuration. We hope to release a version of this this year, probably in the fall it'll be open source and of course any contributions to that work would be appreciated. That's our update and again I would like to apologise for not being here and if there are any questions I'd be very happy to answer them. Very. I'm wondering how Mumento deals with the challenge that sites aren't as we were hearing this morning static anymore and I'm thinking in terms of WordPress for example that I could imagine creating a time gate in WordPress somehow that was aware of WordPress's own storage of revision history presented something that was at a certain time as it was at a certain time in the past but then WordPress isn't simply the content it's also a skin that people have created it's whatever other pieces are in a sidebar it can't even present its own site probably a few years down the road the way it actually was at a point in time so when you're thinking about developing a Mumento awareness in something like a media wiki or a WordPress or tools of that sort what does that really mean and what would the creators of that kind of compatibility need to try to be aware of and do? For a media wiki we're pretty fortunate that they have very good archiving or versioned history of almost everything they don't have a good versioned history for the themes however it is available the resources are available I should say however you can't just install a plugin for it you can actually get in and modify the core code WordPress we did I don't think we've looked at WordPress actually we looked at Drupal and found their underlying version management system is not as strong as Meteorokies and they reuse URIs in certain circumstances so that makes it slightly troubling so for the themes and images having copies of them with time is very important so that you can then instead of having it as part of the content system for WordPress if they were in an external archive then because we were able to distribute it you could then fetch the images in this CSS and so forth from that archive rather than directly from WordPress I know what you're just about to say so if you get in and modify the code of course if you change to PHP then you have another problem well in the essence of themes and skinning is often code and so it won't run enough you may be running a future version that can no longer even run the theme and it feels like you're back in the world where all you can rely on is the internet archive kind of snapshot of the day it's very hard to imagine how we're going to get even systems with revision history to do this, John so there are two things that would, two ways that we're approaching this problem the naive way is well it's the information that the user is probably after which is being preserved so whether or not it's got current ads from the current day rather than from 2007 they're probably not going to care as long as they've got the information from the page from 2007 however if you're actually interested in how 2007 sites looked because you're an HCI person then you really don't care about the content so that's not a particularly satisfying answer to everyone and the way that we're looking at it at the moment and this is very much research and not necessarily going to work is to use copy on write file systems which preserve snapshots of every single change to the file system you can do that directory by directory so if you had all of your PHP files for example in snapshotting directories on top of that which roll your data back or roll your PHP back I should say to use the version at the time you wanted so then you really would have the process archived as well as the content so you're thinking that was very good Thank you Rob, you've made fabulous progress since we last talked about this and I'm glad I could have set you up so well it says we're really thinking about some of the same problems and there's much to go forward two questions, one kind of a short one and one a longer one if I may the shorter one is on your slide about the branding and I'm wondering why it is that you're thinking of that as a branding issue as opposed to a question of identity and persistence so that people can know exactly where each one came from Right, they're very much interrelated problems so it also comes to the trust at the same time that if you don't know or if the user is not informed as to where each resource that makes up the view comes from then there are three problems you lose the branding for the archive in which case they don't get any name recognition they can't then sell themselves as we're the internet archive or we're the British libraries we get so many hits and people know our name so if you just had a video archive or an image directly a CSS archive is the worst of all possible situations because you'll never see this CSS because it's a process on top of the HTML so why would anyone give money to run or use money to run a CSS archive even though it would be particularly valuable the user then also can't see where it comes from which means that they don't have any information available to them to know whether they can trust the representation that's being delivered to them so if this image came from an untrustworthy source that instead replaced every single image of Bill Clinton with Mickey Mouse the browser would still say hey this is the closest in time image for this particular original resource I should display it so at the moment everything is fine because trust with the archives are available however in the future if this were to really take off you might have spammers creating their own momentum archives of ads just to be inserted into pages and in that case we really need to be able to know whether or not we can trust where this image is coming from we need to know the archive did that answer the question? it did let me pose the other one just to frame it but then give David his time and if you have time to come back to my fine it's on the slide that you had before with your three little charts about reproducibility dropping early on that one if we have time to talk about that in more depth I hope we will because I think you're exactly right here and I'm thinking about streaming data which is harder and harder to reconstruct whether it's the sensor networks, the flux net all kinds of observational climate data have these characteristics and the reproducibility problem which is an extremely high bar even though that word is kind of scattered around the data management plans it's a really, really high bar to think about so let me hold that and leave to David and if we have time to come back please do I just wanted to point out that it's not a hypothetical issue here and in a January post on my blog I used the example of a journal called Grubb which went out of publication and is now claimed to be available through four separate archives there's Portico Portico claims to have it and they do in fact have it but you need to pay them in order to get it most people won't be able to secondly the KB, the KB has it but you can only get at it if you're at the KB thirdly the internet archive claims to have it and what they have is the front page the abstracts of all the articles and for every full text article they have a login page the internet archive if this was a subscription journal and they couldn't collect the actual text and then there's Clocks which has triggered this content, it's open access all the content is there so that in this framework where it's entirely up to an archive what they claim to have you have three chances out of four and when you follow your nose you're going to end up at something that isn't actually the content you want and this is a serious problem we're going to have effectively the analog of search engine optimization wars in archiving and the branding issue is just the tip of this iceberg and I'm really passing this but this is just the meeting edge of a big problem but let's get back Christine's point imagine that so I just wanted to build on what both of you have talked about for a minute and going back to a couple of themes that Christine mentioned around marrying policy and incentives I think one of the big challenges in a lot of the archives that maintain large collections of web data for example it's not just an issue of branding it goes back to what are the constraints that that particular institutional entity has to operate under and one of the results is not only this problem of can the archive collect material in the first place crossing barriers but also then can it be re-displayed so in some cases image content may not appear on a public archive because it happens to be individually excluded based on an active robots.txt and for archives like the internet archive we actually check those exclusions and cash them in 24-hour windows so if there's an active exclusion that may not be the same site owner we have to respect those constraints in order to maintain the level of openness that we're able to provide for national libraries they follow completely different standards and every individual archive is beholden to a different set of policies and legislation so one idea that we've actually been trying to explore within the community and be interested in hearing some thoughts about this is to borrow from the internet security industry and their concepts around these anonymous data collaboratives so conceptually what you enable individual actors to do is to contribute information about locations on the web but you do that in an anonymous way so that you have a repository that's in aggregate reflective of all the activity and you have a little bit more information about what's going in and out so imagine a context in which Memento as a researcher tool in the global community could leverage that kind of data set and know where these various resources were located and even if we had to work in this imperfect context where some material couldn't be made publicly accessible and you'd have to be on site or you'd have to present credentials as a validated researcher in recognized academic context etc it may be a stepping stone toward getting towards some of the quality issues and also trust that both of you have articulated anyway it's something we're all struggling with we're trying to figure out how do we contribute how do we maximize the openness because this open versus closed world is incredibly difficult to navigate and it only Memento only works if we can maximize the amount of information that's available for the collection of data sets of interest and then we can get to the point where hopefully the content follows from there I have nothing further to add essentially on that topic other than that although in the current implementation all of the time gates are completely open you could have time gates that were community specific in which you had to authenticate to the time gate either via username and password, via Shibboleth, via open ID that sort of thing in which case after you have established your credentials with the time gates which is the resource that does the public redirections it could then make better decisions for you as to where you would want to be taken and that information that Chris was just talking about would be perfect in order to make those sorts of decisions so for example the Austrian web archive you can only access when you're on site although I understand that that might be changing if they had a policy where only people who are currently in Austria as opposed to only within their library you can do it then you can imagine a time gate which people in the .at domain could use in order to access resources from the .at domain so there's a couple of ways that Memento may be able to help in that regard using the sorts of information that Chris and David would I'm glad Chris brought the open closed in because that's one of the policy flags and I guess maybe the only other thing to pursue at this point is free to which Memento may be able to respect policy flags also and this is a big move of Creative Commons and Science Commons too is to try to mark data in ways that you know what rights are attached to it and that's going to be one more I mean talk about an intersection of technology people and policy I mean that's really it right there that we've got to deal with and would you see that information being attached at the URI level so for example in the HGPB response you'd get back a link to Creative Commons license to say that this particular resource has this license and it means that that can be respected or at the collection level where you'd say all of these resources have this particular policy attached to them well you're far better the technical expert on that than I am but let me take that back up to a larger question and it's one that I've raised within the board the Board on Research Data and Information I hope will be addressed at this August meeting at Berkeley with data site and some others and Birdie is doing part of that is that the kind of policy pressure has been on getting people to site data properly but my argument is the problem starts much earlier in the life cycle is people producing data have no real way of marking it in any package of appropriate granularity to decide what to site and even if the investigator says okay this is the right granularity once it gets mixed up and mashed up in lots of other ways whatever the unit is is going to look very different to the re-user so do you need to deal with the citation problem at the point of origin or farther down the line right now it's only being looked at farther down the line we need to look at it at the point of origin if we're going to figure out whether it's going to be marked up with policy with the persistent identifiers so that Memento, RE, anybody else can pick it up internet archive and be able to handle it properly so alright that was a fantastic conversation are there any other questions or four comments well with that did you have another question or was that more towards the policy well I'd like other people to have an opportunity surely there's other questions around here and you and I can continue that when it links do you want to go back to that slide with your three cute little charts on it could you explain to us what you mean by reproducibility there I think that would help so by reproducibility here I mean given the data and the process that you applied to the data you end up with the same result so given a data set given some code or some process that runs on top of that data set if you can reapply the process at a later point in time to the data and you end up with the same result then your your experiment is reproducible so here as data changers you can apply the same process to changing data and you'll end up with different results so it's not reproducible because you don't have access to the prior state of the data if you have a static data set and a changing process you have the same problem so here when you add those two together those two pictures you end up with a curve that turns towards zero that's actually a fairly weak definition of reproducibility that's a low bar because you're just putting a black box in there saying given these inputs do I get the same outputs where recreating the experiment recreating the study or what we're thinking about in terms of science is way harder than that I'm only talking about for the purposes of the weekend reproducing access to computationally processable resources as opposed to the entire scientific domain which is mostly what Victoria Stodden is talking about in reproducibility the computational but that's a small subset of actual research that people would like to make reproducible so that clarity helped is the again when NSF says reproducibility and data management plans it's meaning something much grander and when the OAIS guidelines say mandatory all in bold independently understandable is having a much higher bar than what you've got here and this is really hard right there we only hope to provide some small part of the solution towards this small part of the very large problem thank you for making me clarify that okay well thank you all very much