 Okay, I'm Dave Rosenthal from the LOX. Lots of copies keep stuff safe program at Stanford University Libraries. As with all my talks, you don't need to take any notes or ask for the slides because the text of the talk with links to all the sources will go up on my blog shortly. Okay, one of the preservation networks that the LOX program runs is the Clocks Archive, a large dark archive of e-journals and e-books. We operate it under contract to a not-for-profit organization jointly run by publishers and libraries. Earlier this year we completed a more than year-long process that resulted in the Clocks Archive being certified to the Trusted Repository Audit Criteria track by CRL. We equaled the previous highest score and received the first ever perfect score for technology. At documents.cox.org you'll find all the non-confidential material on which the auditors base their assessment. And on my blog you'll find posts announcing the certification, describing the whole process we went through, discussing the lessons we learned, and describing how you can yourself run the demos that we put on for the auditors. Although CRL's certification was to track, the documents include a finding aid structured according to ISO 16363, which is the official ISO standard that's superseding track. If you look at the finding aid or at the voluminous ISO 16363 documents you'll see that many of the criteria are concerned with economic sustainability. Among the confidential material the auditors requested were budgets for the last three years and projections for the next two showing revenue and expenses. We actually gave them five year projections. This is an area where we had a good story to tell. The LOX program got started with grant funds from NSF, the Andrew Mellon, Andrew W. Mellon Foundation and Sun Microsystems. But grant funding isn't a sustainable basis for long-term preservation. In 2005 the Mellon Foundation gave us a two-year grant which we had to match and at the end of which we had to be off grant funding completely. The LOX software is free and open source. The LOX team charges for support and services. So achieving this economic sustainability, which we've been off grant funding for more than seven years now, has required a consistent focus on minimizing the cost of every aspect of our operation. Because the LOX systems lots of copies trades using more disk space for using less of other resources such as lawyers, I've been researching in particular the cost of storage for some years. In what follows I want to look at the big picture of digital preservation costs and their implications. And it's in three sections. The current situation, cost trends and what can we do. So the current situation. How well are we doing at the task of digital preservation? So attempts have been made to measure the probability that content is preserved in some areas, e-journals, a thesis and the surface web. So in 2010 the ARL reported that the median research library received about 80,000 serials. Stanford's numbers support this. Keeper's Registry, across its eight reporting repositories, reports just over 21,000 preserved in about 10 and a half thousand in progress. Thus under 40% of the median research library's serials are at any stage of preservation. So Louis Farrier and his co-authors compared information extracted from journal publishers websites with the Keeper's Registry and concluded that more than 50% of all journal titles and 50% of all attributions were not in the registry. The Hyperlink project studied the links in 46,000 US E-theses and determined that about 50% of the link to content was preserved in at least one web archive. Scott Ainsworth and his co-authors tried to estimate the probability that a publicly visible URI was preserved as a proxy for the question how much of the web is archived. They generated lists of random URLs using several different techniques including sending random words to search engines and random strings to the bitly shortening service. Then they tried to access the URI from the live web and they used Memento to ask the major web archives whether they had at least one copy of that URI. The results are somewhat difficult to interpret but for their two more random samples they report URIs from search engine sampling have about a two-thirds chance of being preserved and bitly URIs about one-third. So are we preserving half the stuff that should be preserved? Unfortunately there are a number of reasons why this simplistic assessment is wildly optimistic. First, the assessment isn't risk-adjusted. As regards to scholarly literature, librarians who are concerned with post-cancellation access not so much with preserving the record of scholarship have directed resources to subscription rather than open access content and within the subscription category to the output of large rather than small publishers. Thus they've driven resources towards the content at low risk of loss and away from content at high risk of loss. Preserving Elsevier's content makes it look like a huge part of the record is safe because Elsevier publishes a huge part of the record but Elsevier's content is not at any significant risk of loss and is at very low risk of cancellation. So what have those resources achieved for future readers? Very little. As regards of web content, the more links to a page the more likely the crawlers are to find it and thus other things like robots.txt being equal the more likely it is to be preserved but equally the less at risk of loss. Second, the assessment isn't adjusted for difficulty. A similar problem in risk aversion is manifest in the idea that different formats are given different levels of preservation. Resources are devoted to the formats that are easy to migrate but precisely because they're easy to migrate they're at low risk of obsolescence and the same effect occurs in the negotiations needed to obtain permission to preserve copyright content. Negotiating once with a large publisher takes a certain amount of work but it gets a large amount of very low risk content. Negotiating with a small publisher takes not much less work and gains a small amount of high risk content. Similarly the web content that's preserved is the content that's easier to find and collect. Smaller less link websites are less probably less likely to survive. So harvesting the low hanging fruit directs resources away from the content at risk of loss. Oops, some notes have gone missing. Sorry about this. They've stuck together. I got coffee off. So third, the assessment is backwards looking. In regards scholarly communication it looks at the traditional forms, books, theses, papers. Ignores not merely the published data but also all the more modern modes of communication such as workflows, source code repositories and social media which are mostly at higher risk of loss than the traditional forms because they lack well established business models and they're more difficult to preserve because the legal framework is unclear, for example, research data and the content's either much larger or much more dynamic or both. And as regards the web it looks only at the traditional document centric surface web rather than including the newer dynamic forms of web content and the deep web. And it's likely to suffer measurement bias. The scholarly literature are based on bibliographic metadata which is notoriously noisy and apparently the metadata was not deduplicated so there's some amount of double counting going on. And Scott Ainsworth's paper is full of caveats about how the measurement bias that involves. So Cliff pointed out in his summing up of the 2014 IDCC conference, scholarly literature and the surface web are genres of content for which the denominator of the fraction being preserved, the total amount of genre of content is fairly well known even if it's difficult to measure the numerator, the amount being preserved. For many other important genres even the denominators becoming hard to estimate as the web enables a whole variety of different distribution channels. So books used to be published through well-defined channels that assigned ISBNs so you could count ISBNs but now e-books can appear anywhere on the web and most of them don't have any ISBN. YouTube contains a lot of things that used to be movies. It contains a lot of music publishing these days. There's example is Pompamuse which is a YouTube phenomenon by a couple of Stanford grads. Of course what we're preserving is a judgment call but clearly even purists who wish to preserve only stuff that future scholars will absolutely undoubtedly require access to would be hard pressed to claim that half of that stuff is preserved. So overall it's clear that we're preserving much less than half the stuff that we should be preserving. What can we do to preserve the rest of it? Well we can do nothing. In which case we needn't worry about bit rot format obsolescence and all these other risks because each of those only loses a few percent. The reason why more than 50 percent of the stuff won't make it to future readers is can't afford to preserve. Or we can double the budget for digital preservation. This is so not going to happen. We'll be lucky to sustain current funding levels. Or we can more than half the cost per unit content. Doing so requires a radical rethink of our preservation processes and technology. And such a radical rethink requires understanding of where the costs go in our current preservation methodology and how they can be funded. So as an engineer I'm used to using rules of thumb. The one I use to summarize most of the research into past costs of preservation is that ingest takes about half lifetime cost, preservation takes about a third and access takes about a sixth. So on this basis one would think that the most important thing to do would be to reduce the cost of ingest. It's important but it's not as important as you might think. The reason is that ingest is a one-time upfront cost. As such it's relatively easy to fund. In principle research grants, author page charges, submission fees and other techniques can transfer the cost of ingest to the originator of the content and thereby motivate them to explore the many ways in which ingest costs can be reduced. But preservation and dissemination costs continue for the life of the data forever. Funding a stream of unpredictable payments stretching into the indefinite future is hard. Reductions in preservation and dissemination costs will have a much bigger effect on sustainability than the equivalent reductions in ingest costs. So on to cost trends, cost trends for preservation. So Crider's law held for three decades which is an astonishing feat of exponential growth. For 30 years this prices, this storage prices drop 30 to 40% a year. This meant that if you could afford to store the data for a few years the cost of storing it for the rest of time could be ignored. Because of course Crider's law would continue forever. The second is that as the data got older access to it was expected to become less frequent. Thus the cost of access in the long term could be ignored. But can we continue to ignore these problems? Well something that goes on for 30 years gets built into people's model of the world but as Randall Munro points out in the real world exponential curves cannot continue forever. They're always the first part of an S curve. And this graph from Pre-Degupta of UC Santa Cruz plots the cost of per gigabyte of disk drives against time. And we can see that in 2010 Crider's law abruptly stopped. In 2011 the floods in Thailand destroyed 40% of the world's capacity to build disk drives and prices doubled. Earlier this year they finally got back to 2010 levels. Industry projections are for no more than 10 to 20% a year going forward. Those are the red lines on the graph. This means that disk is now about seven times as expensive as it would have been expected to have been had the pre-2010 trend continued. And in 2020 it will be between 100 and 300 times as expensive as people would have expected up to 2010. So these are big numbers but do they matter? After all preservation is only about one-third of the cost and only about one-third of that is media costs. So our models of the economics of long-term storage compute the endowment, which is the amount of money that deposited with the data and invested at interest would fund its preservation forever. This graph, which is from my initial rather crude prototype economic model, is based on hardware costs from bank blaze and running costs from the San Diego Supercomputer Center, which are much higher than bank blazes, and from Google. It plots the endowment needed for three copies of 117 terabyte data set to have a 95% probability of not running out of money in 100 years against the crater rate, the annual percentage drop in dollars per gigabyte. The different curves represent policies of keeping the drives one, two, three, four, five years. Up to 2010 we were in the flat part of the graph where the endowment is low and doesn't depend much on the exact crater rate. This is the environment in which everybody believed that long-term storage was effectively free. But suppose the crater rate were to drop below 20% a year. We would be in the steep part of the graph where the endowment is both much higher and also strongly dependent on the exact crater rate. We don't actually need to suppose pretty's graph and industry projections so that now and for the foreseeable future we are in the steep part of the graph. So what happened to slow criders rule? There are a lot of factors. We outline many of them in a paper for UNESCO's memory of the World Conference. But briefly, both the disc and tape markets have consolidated to a couple of vendors, turning what used to be a low- margin competitive business into a one with much better margins. And each successive technology generation for discs and for tape requires a bigger investment in manufacturing. So requires bigger margins, so drives consolidation. And the technology needs to stay in the market longer to earn back the investment, which reduces the rate of technological progress. So thanks to aggressive marketing, it's commonly believed that their cloud solves this problem. Unfortunately, cloud storage is actually made of the same kind of discs that local storage is. And it's subject to the same slowing of the rate, which it's getting cheaper. In fact, when all costs are taken into account, cloud storage is not cheaper for long term preservation than doing it yourself. Once you get to a reasonable scale, you don't have to get to the same scale as Amazon, because Amazon is keeping most of the economies of scale for Amazon. What you have to do is to get big enough to gain the economies of scale that Amazon is leaving on the table for its customers. Cloud storage really is cheaper, if your demand is spiky, but digital preservation is the canonical base loaded application. So as I mentioned in the previous session, you may think that cloud is a competitive market. In fact, it's dominated by Amazon. So all these discussions about the hypothetical cloud vendor, you can substitute Amazon for. When Google really recently started to get serious about competing in this market, they pointed out that Amazon's margins may have been minimal when Amazon started, but by then they were extortionate. Notice that, yes, Google triggered a major price drop from Amazon, but it's a one time event. It was a signal to Amazon that they couldn't have the market to themselves and to smaller players, such as Rackspace, that they were going to find it very difficult to compete in the future. So in fact, commercial cloud storage is a trap. It's because it's free to put data into a cloud service such as Amazon Earth 3, but it costs to get it out. For example, getting your data out of Amazon's Glowshare without paying an arm and a leg takes two years. If you commit to the cloud as long term storage, you have two choices. You either keep a copy of your original, you either keep a copy of everything outside the cloud, in other words, you don't actually commit to the cloud, or you stay with your original choice of provider no matter how much they raise the rates. Unrealistic expectations that we can store the vastly increased amount of data projected by consultants such as IDC within current budgets place currently preserved content at considerable risk of economic failure. Here's a graph that illustrates the looming crisis in long term storage. It's cost. The red line is the criders law at IHS I supplies 20% a year. The blue line is the IT budget, which computer economics.com estimates to be increasing at 2% a year. The green line is the annual cost of storing the data accumulated since year zero at the 60% a year growth rate projected by IDC all relative to the value in the first year. 10 years from now, storing all the accumulated data will cost over 20 times as much as it does this year. So if storage is 5% of your IT budget this year, in 10 years it will be more than 100% of your budget. If you're in the digital preservation business, storage is already way more than 5% of your IT budget. So even if you're not growing 60% a year, you still may have a serious problem. Yes, so yes, what I'm saying is that even if you're even if you're if you're growing at anything less than the crider rate and anything more than the crider rate, you're in trouble. And the crider rate may be as low as 10%. Okay. So I think there are very few digital preservation systems whose content is growing only 10% a year. So the storage part isn't the only part that's going to be much higher than people expect. Access will be too. So the Blue Ribbon Task Force pointed out the only real justification for preservation is to provide access. Well, with research data in particular, this can be a real difficulty because the value of the data may not be and thus access to it may not be evident for a long time. My favorite example is that Shang Dynasty astronomers in China inscribed eclipse observations on animal bones. About 3,200 years later, researchers use these records to estimate that the accumulated clock error was about seven hours. And from this, they derived a value for the viscosity of the Earth's mantle as the continents rebound from the ice age. So in most cases so far, the cost of an access to an individual item has been small enough that archives have not charged the reader. Research into past access patterns to archive data show that access was rare, sparse, and mostly for integrity checking. But the advent of big data techniques mean that, going forward, scholars increasingly don't want to access a few individual items in the collection. They want to ask questions of the collection as a whole. For example, the Library of Congress announced that it was collecting the entire Twitter feed and almost immediately had 400-odd requests from scholars for access to this collection. The scholars weren't interested in accessing individual tweets. They were interested in mining information from the entire history of tweets. Unfortunately, the most the library could do with the feed was to write two copies of each tape. There's no way they could afford the computer infrastructure to do the data mining. So we can get some idea of how expensive this is by comparing Amazon's S3, which is designed for data mining type access, with Amazon's Glacier, which is designed for traditional archival access. S3 is currently at least two and a half times as expensive, until recently it was five and a half times. So ingest. Almost everyone agrees that ingest is the big cost element in preservation. So where does the money go? The two main cost drivers appear to be the real world and metadata. In the real world, it's natural that cost per unit content increases through time for two reasons. The content that's easy to ingest gets ingested first. So over time, the difficulty of ingestion increases. And digital technology evolves rapidly and mostly by adding complexity. So for example, the early web was a collection of linked static documents. Language was HTML. It was reasonably easy to collect and preserve. The language of today's web is JavaScript. And much of the content you see is dynamic. This is much harder to ingest. In order to find the links, much of the connected content now needs to be executed as well as simply being parsed. And this is already significantly increasing the cost of web harvesting, both because executing the content is computationally a lot more expensive and because elaborate defences are needed to protect the crawler against the possibility that the content might be aligned. It's worth noting, however, that the very first US website from 1991, which Stanford's web harvest team has just restored, was in fact a dynamic content, because it was front end to a database. So the days when a single generic crawler could collect pretty much everything of interest had gone. Future harvesting will require more and more custom tailored crawling, such as we need to collect subscription journals and books in the LOPS program. This per site custom work is expensive in staff time, so the cost of ingest seems doomed to increase. And worse, the W3C's mandating of DRM for HTML5 means that the ingest costs for much of the web's content will become infinite because it simply won't be legal to ingest it. And metadata in the real world is widely known to be of poor quality, both format and bibliographic kinds. And efforts to improve the quality are expensive because they're mostly manual and inevitably reducing entropy after it's been generated is a lot more expensive than not generating it in the first place. OK, so all three phases are going to get more expensive and we're already preserving less than half the content that needs preservation. We need to change our processes to greatly reduce the cost for unique content. So what can we do? Well, as far as preservation is concerned, it's often assumed that because it's possible to store and copy data perfectly, that only perfect data preservation is acceptable. And there's two problems with this expectation. To illustrate the first problem, let's examine the technical problem of storing data in its most abstract form. Since 2007 I've been using the example of a petabyte per century. Think about a black box into which you put a petabyte and a century later out of which you take a petabyte. Inside the box there can be whatever you want, as much redundancy as you want, whatever media you choose, whatever anti-entropy protocols you like. You want to have a 50 percent chance that every bit in the petabyte is the same when you take it out as it was when it went in. Now consider each bit in that petabyte like a radioactive atom subject to a random process that flips it with very low probability per unit time. You've just defined a half-life for the bits. That half-life is about 60 million times the age of the universe. Think for a moment about how you go about benchmarking a system to show that no process with a half-life less than 60 million times the age of the universe was operating in it. It simply isn't feasible. Since at scale you're never going to know that your system is reliable enough Murphy's law will guarantee that it isn't. And note that the 10 to the minus... So Amazon's state-of-the-art storage system has a design goal of an annual probability of loss of a data object of 10 to the minus 11. If the average object is 10 K bytes the bit half-life is somewhere around a million years which is way too short to meet the requirement but it's still really hard to measure. And note that that 10 to the minus 11 is a design goal not the measured performance of the system. There's a lot of research into the actual performance of storage systems at scale and it all shows them underperforming expectations based on the specifications of the media. Why is this real storage systems a large complex systems subject to correlated failures that are very hard to model? Worse the threats against which they have to defend their contents are diverse and almost impossible to measure model. Nine years ago we documented the threat model we used for the lock system. We observed that most discussion of digital preservation focused on these threats but the experience of the operators of large data storage facilities was that the significant causes of data loss was quite different. To illustrate the second problem consider that building systems to defend against all these threats combined is expensive and it can't ever be perfectly effective. So we have to resign ourselves to the fact that stuff is going to get lost. And this has always been true of archives. It shouldn't be a surprise and this is subject to the law of diminishing returns. So coming back to economics how much should we spend reducing the probability of loss? So consider two storage systems with the same budget over a decade. One has a loss rate of zero the other is half as expensive per unit content but which loses one percent of its content each year. Clearly you would say that the cheaper system has an unacceptable loss rate. But each year the cheaper system stores twice as much and loses one percent of its accumulated content. At the end of the decade the cheaper system has preserved 1.89 times as much content at the same cost and after 30 years it's preserved more than five times as much at the same cost. So adding each successive nine of reliability gets exponentially more expensive. How many nines do we really need? So the canonical example of this is the internet archives web collection. Injust by crawling the web is as everyone can understand a lossy process. The archive storage system loses a small proportion of its content every year. Access via the way back machine is not totally reliable. Yet for US users archive.org is currently the 150th most visited site on the web whereas the Library of Congress which is a lot more careful is the 1,519th. For UK users archive.org is 131st most visited site whereas the British Library is the 2,744th. Why is this? Because the archives collection was always a series of samples of the web the losses merely had a small amount of random noise to the samples but the samples were so huge that the noise is insignificant. This isn't actually something about the internet archive it's something about very large collections of data. They always have noise in them. Questions asked of them are always fundamentally statistical in nature. The benefit of doubling the size of the sample in this case and in many cases vastly outweighs the cost of a small amount of extra noise. In this case more really is a lot better. So unrealistic expectation for how well data can be preserved make the best be the enemy of the good. We spend money reducing even further the small probability of even the smallest loss of data. That money could instead preserve vast amounts of additional data albeit that a slightly higher risk of loss. So within the next decade all the current populace storage media disk, tape, flash will be up against extremely hard technological barriers. A disruption of the storage market is inevitable. We should work to ensure that the needs of long-term data storage will influence the result of that disruption. We should pay particular attention to the work underway at Facebook and elsewhere that uses techniques such as erasure coding geographic diversity and custom hardware based on mostly spun down storage to achieve major cost savings for cold data at scale. So every few months every few months there's another press release announcing some new quasi-immortal medium such as fused silica glass or stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Long-lived media are inherently more expensive and they're a niche market so they lack economies of scale. Seagate for example could easily make disks which have archival life 25 years or so but they did a study of the market for them and discovered that no one would pay the relatively small additional cost. The fundamental problem is that long-lived media only makes sense if the crider rate is very low. Even if the rate's only 10% a year after 10 years you could store the same data in a third the space. Since space in the data center or even at Iron Mountain isn't free this is a powerful incentive to move old media out. If you believe that crider rates will get back to 30% a year after a decade you could store 30 times as much data in the same space. So the reason that long-lived media is such an attractive idea is it suggests you can be lazy and design a system that ignores the possibility of failures. You can't. Media failures are only one of many many threats to store data but they're the only one that long-lived media addresses. And long media life does not imply that the media are more reliable only that their reliability decreases with time more slowly. As we've seen current media are many orders of magnitude too unreliable for the task ahead. And even if you could ignore failures it wouldn't make economic sense. As Brian Wilson, CTO of Backblaze points out in their long-term storage environment double the reliability is worth one tenth of one percent cost increase. Moral of the story designed for failure and by cheap components. Okay so cutting dissemination The real problem with dissemination in the future is that scholars are used to having free access to library collections and research data but what scholars now want to do with archive data is so expensive that they must be charged for access. This in itself has costs since access must be controlled and accounting undertaken. Further data mining infrastructure at the archive must have enough performance for the peak demand but will likely likely be likely used most of the time which increases the cost for individual scholars. A charging mechanism needed to pay for this infrastructure. Fortunately because the scholars access patterns are spiky the cloud provides both suitable infrastructure and a charging mechanism. So for smaller collections which are public Amazon provides free public data sets. Amazon stores a copy of the data in no charge and charges scholars accessing the data for the computation rather than charging the owner of the data for the storage. And even for large and non-public collections it may be possible to use Amazon. Suppose that in addition to keeping the two archive copies of the Twitter feed on tape the Library of Congress kept one copy in S3's reduced redundancy storage the cheap version of S3 simply to enable researchers to access it. For this year it would have cost about $4,100 a month or about $50,000 for the year. Scholars wanting to access the collection would have to pay for their own computing resources at Amazon and the per request charges. Because the data transfers would be internal to Amazon there are no bandwidth charges. And the storage charges could be borne by the Library or charged back to the researchers. If they charge back the 400 initial requests would each need to pay about $125 for a year's access to the collection which is not an unreasonable charge. Because the Library's preservation copy isn't in the cloud they aren't locked in to Amazon as a supplier or this strategy for access. So in the near term separating access and preservation copies in this way is a promising way not so much to reduce the cost of access but to fund it more realistically by transferring it from the archive to the user. In the longer term architectural changes to preservation systems that closely integrate limited amounts of computation into the storage fabric have this potential for significant cost reductions to both preservation and dissemination. They're encouraging early signs that the storage industry is thinking along these lines. So cutting ingest cost. Well there are two parts to the ingest the content and the metadata. So as I've said the evolution of the web that poses problems for preservation also poses problems for search engines like Google where they used to pass the HTML of a page into its document object model in order to find the links to follow and the text to index. They now have to construct the CSS object model including executing the JavaScript and combine the OM and the CSS OM into the render tree to find the words in that context which is the context of the words is something that Google cares about. Preservation crawlers such as Heratrix used to construct the DOM in order to find the links and then preserve the HTML. Now they also have to construct the CSS OM executing the JavaScript and so on. It might be worth investigating whether preserving a representation of the render tree rather than the HTML or CSS JavaScript and all the other components of the page as separate files would reduce costs. It's also becoming clear that there's much important content that's either too big, too dynamic, too proprietary or too DRM for ingestion into an archive to be either feasible or affordable. In these cases where we simply can't ingest it preserving it in place may be the best we can do creating a legal framework in which the owner of the dataset commits for some consideration such as a tax advantage to preserve their data and allow scholars some suitable access. Of course since all the data will be under a single institution's control it will be a lot more vulnerable than we would like but this type of arrangement is better than nothing and not ingesting the content is certainly a lot cheaper than the alternative. Metadata is currently considered essential for preservation. For example of the criteria in ISO 16363 section 4 of which there are 52 29 of them are more than half are metadata related. Creating and validating metadata is expensive. Manually creating it at the scale that we operate is simply impractical. Extracting metadata from the content scales better but it's still expensive since considerable per site work is needed and as far as format metadata is concerned it's computationally expensive to generate and validate and in both cases the extracted metadata is sufficiently noisy to impair its overall usefulness. So we need less metadata so we can have more data and two questions need to be asked when is the metadata required? We had some very interesting discussions in the preservation at scale workshop contrasting the pipelines of Portico and the clocks archive which ingest much of the same content. Portico's pipeline is much more expensive because it extracts, generates and validates metadata during the ingest process. Clocks because it is a dark archive has no need to make the content immediately available implements all its metadata operations as background tasks which can be performed as resources are available. And the other question is how important is the metadata to the task of preservation? Generating metadata because it's possible or because it looks good in voluminous reports is all too common. Format metadata is often considered essential to preservation but if format obsolescence isn't happening or if it turns out the emulation rather than format migration is the preferred solution it's a waste of resources. If the reason to validate the formats of incoming content using error prone tools is to reject allegedly non-conforming content it's actually counterproductive because the majority of format content and formats such as HTML and PDF fails validation but renders perfectly allegedly. Resources should be devoted to avoiding spilling milk rather than cleaning it up. For example, given how much the academic community spends on the services publishers allegedly provide in the way of improving the quality of their publications it's an outrage that even major publishers cannot spell their own names consistently cannot format DOIs correctly get authors names wrong and so on. So the alternative is accept that metadata correct enough to rely on is impossible. Downgrade its importance to that of a hint and the hint can be very useful and stop wasting so much resource on it. One of the reasons that full-time search dominates bibliographic search is that it handles the messiness of the real world much better. So attempts have been made for various kinds of digital content to measure the probability of preservation. The consensus is about 50%. Thus the rate of loss to future readers from never preserved will vastly exceed that from all other causes such as bit rot and formative adolescence. This raises two questions. So we're persisting with current preservation technologies improve the odds of preservation. At each stage of the preservation process current projections of cost per unit content are higher than they were a few years ago. Projections for future preservation budgets are at best no higher. So clearly the answer is no. And if not what changes are needed to improve the odds? At each stage of the preservation process we need to add at least half the cost per unit content. I've set out some ideas. Others will have different ideas but the need for major cost reductions needs to be the focus of discussion and development of digital preservation technology and processes. Unfortunately any way of making preservation cheaper can be spun as doing worse preservation. Jeff Rothenberg's Future Perfect 2012 keynote speech is an excellent example of this spin in action. Even if we make large cost reductions institutions have to decide to use them and no one ever got fired for choosing ABN. We live in a marketplace of completing preservation solutions a very significant part of the cost of both not-for-profit systems such as clocks and portico and commercial products such as Preservica is the cost of marketing and sales. For example, track certification is an effect of marketing check-off item these days. The cost of the process that clocks underwent to obtain this check-off item was well in excess of 10% of the annual budget. So making the trade-off of preserving more stuff using worse preservation would need a mutual non-aggression marketing pact. Unfortunately the pact would be unstable because the first product to defect and sell itself as better preservation than those other inferior systems would win. And as private interests work against the public interest in preserving more stuff. To sum up we need to talk about major cost reductions. The basis for this conversation must be more and better cost data. I'm on the advisory board for the EU's 4C project the collaboration to clarify the costs of curation. They're addressing the need for more and better cost data by setting up the curation cost exchange. Please go there and submit whatever cost data you can come up with for your own curation operations. We need a lot better data to understand the problem well enough to achieve the cost reductions that we need. Thank you and I've left plenty of time for questions.