 Let me welcome you to our spring 2009 meeting, everyone. I'm Cliff Lynch, the director of CNI for those of you who I've not had an opportunity to meet yet. And I am absolutely delighted that you are all here. I hope that your travels have not been too difficult. I know that getting places is not always as easy as we'd like it to be, but I'm glad that you all prevailed in major way here. And I really hope that the next day and a half is going to make you feel like it was worth every minute of that time, and then some. We've got a wonderful program, I think, for you. It's my really sincere pleasure to introduce David Rosenthal as our opening plenary speaker. I've known David for quite a long time now. He's well known in our community for his work on the LOX program at Stanford. He actually has done a tremendous amount of stuff in the computer industry going back over 20 years. He was an early person at Sun Microsystems, worked on the Andrew Project at Carnegie Mellon. He was very early at NVIDIA, which builds high-performing graphics chips, which are probably in most of your laptops, at least if you have fairly recent ones, and has also been involved with a variety of other things at Stanford. David has been doing something really important for all of us for the last few years, which is he's, I think, been doing some stepping back and trying to think critically about our assumptions about digital preservation. Without stealing any of his thunder, hopefully, I wanna say that in the last few years, particularly, I've become very uneasy with some of the sort of enshrined assumptions in this area, some of which were laid down by very good and insightful work, but work that was done 10 or 15 years ago, and that really was done by people grounded in the experiences of an era that's quite different from the one we're dealing in now, an era that perhaps was much more characterized by digital in cannabula, rather than the sort of increasingly ubiquitous digital material that we're facing now at almost incomprehensible scale. David has gone back systematically and asked, are these assumptions right? Are they wrong? If they're wrong, what are more accurate assumptions? Where do those assumptions lead us? And I think that it's very, very important for us to engage this kind of thinking as we plan our production, gotta get it right because our society depends on it digital preservation strategies and digital preservation programs. So I invite you to listen to David's reflections on this to think hard about them. I have been fortunate enough to be able to preview some of this material on a couple of different occasions, and I will tell you, for whatever it's worth, it's certainly changed some of my thinking about digital preservation. I won't promise that it will change yours, but I will promise that whatever opinions you hold on it after hearing David will be far better grounded and far better informed. Please join me in welcoming David Rosenthal. Thank you, Cliff. It's an honor to be invited to give this plenary. I'm gonna present a contrarian view of digital presentation one. I hope a lot of you will disagree with. I won't be talking about individual systems, but rather about problems that are facing and shared by all our current digital preservation systems. There's a lot to say. So to leave enough time for questions, I'll be covering a lot of ground quickly and I'll be emitting a lot of points in detail. C&I is gonna be putting these slides up on the web and they'll shortly, probably later this week, be a long post on my blog providing commentary and links to the source materials for what follows. So about six weeks ago, I was sitting in a room very much like this, applauding at the File and Storage Technology Conference as my friend, Kurt McCusick, got an award from the IEEE. The award celebrated his 30 year stewardship of the UNIX file system. Kurt pointed out in his acceptance speech how over those 30 years the file system had grown from about 12,000 lines of code to 55,000 and how these changes had greatly improved the performance and reliability of the file system. After all, performance and reliability are what the audience at the FAST conference care about. But for this audience, I'd like to point out something else about this evolution. Through all this time, 30 years as disks have grown bigger by a factor of about a million, there have been no changes, no incompatible changes to the API or to the on-disk format. The current code could read a disk that was written 30 years ago, provided in the unlikely event that the disk had survived 30 years in working order. The programs that I was writing for UNIX at that time still compile and run as they ever did. It sounds like we should be giving Kurt an award for digital preservation as well, but I'm sure he'd disclaim the credit for this long-term stability. It's simply the result of sound engineering practices for infrastructure software. The benefits of making a change in the system must be balanced against the costs of making it. Incompatible changes to widely used software impose costs on every user. If there are many users, aggregating these costs will overwhelm any possible benefit. This is especially true when the benefits, even if large, accrue to only a few users. So public attention was first drawn to the problems of digital preservation by Jeff Rothenberg's seminal article in the January 95 Scientific American. As he wrote this about the rapid incompatible evolution of digital documents, Kirk's file system was already 16 years old with no incompatible changes. So notice the meme that Jeff is propagating, that in the digital world, incompatibility is like gravity it's an inevitable unending force of nature that can't be understood or resisted. The best we can do is to adapt to it. Well, I'm an engineer with almost 40 years experience of building systems and an entrepreneur with a three for three record of successfully IPOing companies. I just don't recognize this meme. Incompatibility isn't inevitable. It's a choice someone made. I make that choice very, very reluctantly after carefully assessing the costs and benefits and only as a last resort. The assessment almost always indicates that it's worth paying almost any cost to avoid incompatibility. Kirk clearly agrees with me. In fact, there's a name for the process that creates incompatibility. It's reinventing the wheel. So why in 1995 did Jeff think that incompatibility was inevitable? If he were writing today, would he still think that? If not, what causes incompatibility and are the same causes operating today as they were in 1995? So this talk is in three parts. Our first review, Jeff's 50 year look forward from 1995 looking up what he predicted and why and the state of the IT market that made his predictions plausible. I'll then look at what happened in the 14 plus years since asking what the impacts of Jeff's article were, what happened to the IT market in the intervening time and what the effects were on Jeff's predictions. And then I'll grab the prophetic mantle away from Jeff and bravely attempt to predict the future, thus providing someone with a ready made talk for the spring 2023 CNI meeting. The first two parts of the story are a good news story. Like sometimes into the woods, the happy ending comes in the middle. The third part is not so happy, but it concludes with a set of practical steps we can take to dispel some of the gloom. Okay, we'll start with ancient history, which is everything before 1995, which for me represents two thirds of my time with computers. So Jeff's a really good communicator and he based his article on a scenario that made it very easy to understand. His descendants in 2045 come across a CD that contains the location of Jeff's fortune and he asked what obstacles they need to overcome to find the loot. He identified three of them. There's media degradation, that in 50 years the bits on the CD might have suffered bit rot and no longer be readable. There's media obsolescence, in 50 years his descendants might not be able to lay their hands on a CD drive. And there's format obsolescence. In the 50 years the software needed to render the unrotted bits that the CD drive read from the disc might no longer be available. The first two of these were easy to explain and to defend against by regularly migrating the bits from older to newer media. The third, format obsolescence needed a lot more explanation and it dominates the article. Jeff identified two fundamentally different approaches to defending against format obsolescence. These are format migration and emulation. Jeff didn't approve of format migration at all because in his view each of the repeated migrations between formats that would be necessary would inevitably lose some data. Some information about how the document was to be rendered. He did approve of emulation but he added an important caveat. He pointed out that specifications for the outdated hardware would be needed. The engineers that he envisaged would at some time in the future create emulators for the old hardware capable of running the preserved software that would render the preserved documents would need these specifications. And the specifications would he assumed themselves be digital documents that needed to be preserved. Thereby, deftly reducing the emulation strategy to a previously unsolved problem. Summarizing Jeff's view, we get a pretty bleak picture of the future for digital documents. They survive in offline media like CDs. The bits evaporate off these CDs over time. The readers for them wear out, get lost and can't be replaced. The document formats are and the hardware and the software are all proprietary, non-standard and change frequently in incompatible ways. Two words, desktop publishing. It's really obvious when you read Jeff's article that the way the bits, the way the document got onto his CD was by a desktop publishing system. If you think back to 1995, desktop publishing was all the rage, but it's important to point out that the publishing medium they were talking about was paper. The only design goal for the word and word perfect files was to save the state of the word processor so that you could resume editing the document at some later time before you finally committed it to paper. And these formats were thus exclusively the property of applications and other applications reading those formats was a threat to the business model. Unfortunately, people started emailing these files to each other. They just got there quicker, people could edit them, send them back and so on. And so people started keeping the files rather than the paper. The big surprise looking back at the IT landscape in 1995 was how different it really was. There were many CPU architectures in use. The PC was in the midst of a major architectural transition from the old ISA bus to the new PCI bus. Microsoft hadn't shipped Windows 95 yet. So in those days, although it was actually possible to connect a PC to the internet, very few humans had actually succeeded in doing it. The Mac was running system seven, which was starting to look seriously long in the tooth. Linux was barely functional. The whole application space was fragmented. It wasn't at all clear that whether Microsoft Word or Word Perfect was gonna win. So let's look at what's happened since Jeff's article. The first thing to say, especially since I'm criticizing Jeff's article, is that it was extraordinarily important in drawing public attention to the problem and especially in helping to get funding started. The Mellon Foundation deserves particular credit for funding a wide range of important initiatives including LOX. In the US, the NSF, the Library of Congress and the National Archives all started funding major projects. Elsewhere, the KB and the Australian National Archives deserve particular note. The result is we now have systems in production use that are based on both the strategies that Jeff described. It is however a little odd that many more of them use format migration than use emulation. We should also observe that the Internet Archive started in 1996 preserving the web for posterity and it used neither of these techniques. So the web is obviously the most important development in IT since Jeff's article. It was less than six months later that Stanford's High Wire Press put the Journal of Biological Chemistry online and pioneered the shift of academic publishing from paper to the web. It was less than a year later that NetCraft started measuring the size of the web. This is their current graph. It shows the growth in host names and in really active websites. Host names in blue, active websites in red. As Jeff was writing, it was far from clear that the web would be the success it has been. Even four years later, when Jeff was revising his article in 1999, it was still before the explosive growth of the web. You can see that the flat line starts really shooting up sometime in the middle of 99. So in Jeff's vision, documents survived in offline media. They came online occasionally for manipulation or copying. And copyability was something that was extrinsic to the storage medium. You had to find a separate reader for the media in order to do the copy. Now, if something's worth keeping, it's online. One of my great regrets was that Rob Pike at Google made this a very snappy quote and I was so taken with it that I totally forgot to write it down. And so I don't have the right words, but the thought is due to Rob Pike. Offline backups are now at something temporary. You can think about this because typically they're recycled on a cycle. They're not the way the document actually survives. And copyability is something that's intrinsic to the online medium. No one cares what the actual medium storing the bits is. Some of your laptops have rotating disks. Some of your laptops have flash memory. You don't care. And in fact, if you opened them up and looked, you probably wouldn't be able to tell just from looking at it what's inside that box. What matters is that it conforms to the access protocols. That it implements one of the standard, one of the current standard interfaces, such as SATA or USB or internet protocol. And that what's in the disk is a file system, such as Kirk's. Okay, another major development since Jeff's article was that Microsoft definitively lost the war against its users. In the beginning, Microsoft made money by selling software such as Word, not to actual users, but to PC vendors and companies. But they were so successful that pretty soon they saturated the market. How are they gonna keep going? Well, like the record industry, they realized that what they needed to do was to sell essentially the same product to people who'd already bought it at least once. And they managed to do this by gratuitously introducing incompatibility. The new version of the product, which came with the new computers that were entering the organization, by default wrote documents in a format that the old version couldn't understand. And so in order to read these documents, people with old computers in the organization had to upgrade, that is to buy the same product again. Even though the new product didn't actually provide any capabilities that they really needed. They just wanted to read the document. Eventually, the unhappiness of the actual users forced the important actual customers, that is the companies, to protest at the unnecessary costs that they were forced to bear. This and a whole lot of other abuses led to an antitrust probe. And in 1994, a consent degree settled it, which proved to be completely ineffective. But the user revolt eventually led to a standard for document formats being adopted by ISO, that's ODF. As usual, Microsoft game the standards process, reinvented the wheel, and came up with an incompatible rival, that's OOXML. But in doing so, they conceded the basic point. Their software had become so much a part of the infrastructure that they could no longer pollute it with gratuitous incompatibility. Widespread experience with the rapid pace of new incompatible formats, misled Jeff along with many others. But notice that Microsoft's strategy depended on introducing new formats, not on removing the old formats. If the new computers couldn't read the documents that the old form computers produced, no one would buy the new computers. So last year, Microsoft announced that the forthcoming service pack for Office would remove the support for a whole lot of really, really old formats. And instantly, they were bathed in flames. It took less than a week for them to back down. So if it was ever the case that Microsoft, it was easy for Microsoft to remove old formats, that's clearly no longer the case. The real difference here is that documents are no longer the property of a program. They're content to be published. The hypothetical business model that involves charging users to upgrade their browser so that it will no longer read old content is self-evidently stupid. The browser's free, most of the content is free, and the Office business model is dead. And you can see this, because Microsoft is doing everything it possibly can now to move Office to be a service accessible on the web, not an application that you have to upgrade all the time. The goal of publishing is to reach as many readers as possible. Gratitude is incompatibility is a way to limit the number of readers. Notice the unpopularity of clue-impaired publishers like the British government who put up Internet Explorer-only websites. It turns out that the goal of reaching as many readers as possible means that it's important that many different programs be able to read and understand what you publish. Perhaps the most critical one is not Internet Explorer or Firefox. It's the Googlebot. If Google doesn't index your stuff, no one will read it. This means that even if you control the program that creates the documents, you don't control all the programs that you care about reading them. So another important development, which can't really be said to have happened since Jeff's article, is virtual machines. Both for hardware and software, they have really a long history. But except in the mainframe space, hardware virtualization was not a mainstream technique. In fact, in 1995, it wasn't even possible to virtualize a PC because Intel hadn't quite gotten around to putting all the necessary bells and whistles into the CPU architecture yet. But now, virtual machines for both software and hardware are universal. There's a lot of Macs around here with just standard capability of the Macintosh. In particular, there are industrial strength, open source emulations of the entire PC architecture. Jeff was right about emulation. He was just wrong about how it would come about. Preservation had nothing to do with the rise of virtual machine technology. Similarly, in 1995, open source was a small niche. Linux barely worked. The lawsuit between Unix Systems Labs and PSDI that freed the Unix source code was settled in 1993, but under the condition that it be kept secret. So although there were rumors about what was in it, nobody really knew at that stage. Now, open source is a basic strategy for all but two of the big IT companies. The importance of this from preservation is hard to overestimate. First, open source renderers have been developed for nearly all the major formats and most of the minor ones. There are even open source renderers for most DRM protected formats. Their legality is often challenged, but presumably no one would bother challenging it if they weren't effective at rendering the content. Second, open source is extremely well preserved. The source code's in ASCII, so it's immune from format obsolescence. It's stored in source code control systems capable of rebuilding the entire software stack, the operating system kernel, all the utilities, all the applications and everything, as it was at any point in its history. And this isn't some special preservation technology. It's an everyday necessity of the development process. The source code control systems are all carefully backed up by the development teams and others. Third, for the same reason that there's no flag day on the internet, open source is very hostile to incompatibility. Kirk is special among open source programmers only because of his longevity. All the developers realize that they have no way to force the other developers to adapt to an incompatible change they want to make. So if they make it, it's very unlikely that that change will get into the open source genome. Thus, it's very hard to imagine a scenario, any scenario, in which a format with an open source renderer would ever go obsolete. In effect, open source renderers are executable format metadata. So with 2090 hindsight, we can see that Jeff was wrong in every particular. Documents survive online, on the web. Migration is inherent. Formats are standard and application independent. Proprietary formats get open source renderers. Format obsolescence never happens. So it turns out that a book that explained why Jeff's predictions went awry was published the year before Jeff published his article. W. Brian Arthur was an economist at Stanford and at the Santa Fe Institute. And he explained the behavior of technology markets. And I've tried to illustrate this with this graph that I cooked up which shows installed base over time for several competing products. You can see in the early stages of the development of the market, there are a number of competing products. And after a while, one of them takes off and rapidly comes to dominate the market and the others die gradually on the vine. And Arthur's explanation for this is that IT markets, almost all IT markets have increasing returns to scale. That's the economist's name for it. It's more often called network effects. Metcalfe's law. The value of a network goes as the square of the number of nodes. And IT markets have path dependence which is the economist's name for the butterfly wing effect. There are many players early in the evolution of a market. For some random reason, one of them gets a little bit bigger than any of the others. At that point, the network effects take over and boost that one's market share against all the others. And what this means is that IT markets are inherently subject to capture. Microsoft, Intel are the canonical examples of this. And what happens to captured markets is that change slows down dramatically. We've recently had the wonderful example of this with Vista. And the reason is that the dominant player in the market has every incentive to fear radical change. So history misled Jeff to overestimate change. Go back to the graph for a moment. You'll see this arrow shows the stage at which the losers are losing not just market share, but actual installed base to the winner. Users are switching from the doomed loser to the winner in the marketplace. And at this stage in the development of the market, the winner has a great incentive to make this switch as easy as possible. For example, by providing really good converters from the loser's format to the winners, but not vice versa. But from the preservation point of view, that's what you wanna have happen. Okay, so Jeff being wrong in all his detail projections is actually good news. What it says is that collections that survive are not nearly as hard as we thought they were. There's every reason to believe that if you just collect and keep the bits, particularly if those bits were on the web, that all will be well for a long time. There are free open source tools to do the job of collecting the stuff and keeping it. Just go do it. The major reason why content's getting lost is that no one is bothering to collect it. So we're two thirds of the way through and that was the happy ending. Looking ahead is very risky, as Jeff's example shows. Despite the risk, I'm gonna continue into the future, even though as with Sondheim, things get darker as we proceed. So looking back at our experience of actually getting digital preservation systems into production use, we can see that the real problems were quite different from those Jeff identified. We can sum them up as scale, cost, and intellectual property. So you can think of Jeff as looking at micro level preservation. He was asking how can a single document on a single CD survive into the future? And fortunately, developments in IT made that far easier than Jeff thought it would be. But the same developments made the macro level digital preservation problem enormously larger than Jeff could ever have anticipated. The digital resources that we take for granted every day are now at an industrial scale. They come from data centers the size of car factories that use power the way aluminum spouters do. The storage cost alone of keeping just one copy of just one important database online for just one year can be over a million dollars. Standard, you know, hand curating digital documents one at a time is simply inadequate to society's need for digital preservation. We need processes by which curators can preserve huge collections of data in a day's work. So the big lesson from Google is that there's more value in the connections between the documents than in the individual documents themselves. Ripping documents from their context in order to preserve them destroys this value. We need to preserve documents in such a way that the connections, the links between them get preserved too. And this is another instance of Metcalfe's law. An isolated document is a network of one node. Of course, Google's other big lesson is that operating at the scale needed is really, really expensive. We don't have good numbers for what it costs to run the processes that we think of as digital preservation at the kind of scale that we need. I'm going to use two extremes to get some kind of a ballpark estimate of what we're talking about. I'm going to use archive.org and Portico. I should stress that both these systems are well engineered to meet their goals using their chosen techniques. I'm not criticizing them. I'm simply using them as a way of placing bounds on what it costs to preserve stuff at a large scale. So the internet archive contains something over two petabytes. It's growing at about a quarter of a petabyte a year. This is a big operation. You can come to some kind of understanding of how big it is because when you realize that Google collects the web monthly and then throws it away. But archive.org tries to collect the web monthly and keep it. I believe they currently have two copies and another one coming up. Their burn rate is somewhere between 10 and 14 million dollars a year, which translates to about 50 cents per gigabyte per year. So Portico aims to preserve the academic literature. It's much smaller. It's about 50 terabytes growing, perhaps five terabytes a year. They're still working to ingest it, but let's assume that they've done it. Their burn rate appears to be between six and eight million a year, which translates to about $10 per gigabyte per year. 20 times as expensive. Of course, archive.org should be a lot cheaper than Portico because it isn't doing all that preservation stuff. You know, I'm someone who's published research on data loss in preservation systems using data collected at the internet archive. So I can definitively state that better bit preservation than the archive was doing a couple of years ago is a requirement for good digital preservation. But they've recently switched to shipping containers full of ZFS servers and that should really help a lot. The cost question is important and a factor of 20 is a really big number because society needs to preserve a whole lot of data. We don't know how much data, but let's say it's an exabyte. That's only one byte in every 20,000 that will be created in 2011, let alone all the data that was created in the years up to 2011. With archive.org's cost structure, this would take five billion a year with Portico's it would take 100 billion a year. I don't think the world has five billion a year to spend on this. I'm pretty sure that no one thinks the world has 100 billion a year to spend on this. So another big change brought by the web is that content, pretty much any content, now has a business model. Lawyers are extremely reluctant to risk the business model that's paying their fees. So they don't want you to have a copy of the content that's the basis for this business model. And the result of this was the Digital Millennium Copyright Act. And it means that you have really only two choices. You either get permission from the copyright owner or like archive.org, you rely on the safe harbor provision. And the problem with the safe harbor provision is that it isn't nearly safe enough for preservation. Anyone can write the Internet Archive a letter claiming to own the copyright of some content. And because the archive doesn't have the resources necessary to contest every one of these letters, they simply have to take it down. Of course, one reason why Portico costs a lot more than archive.org is that they do have the lawyers to go talk to the publisher's lawyers and get permission to preserve their content. And this is a really expensive business. One hour of one lawyer is about five terabytes of disk. So 10 hours of one lawyer could buy enough disk to preserve the entire academic literature, which is what Portico's trying to do. An unfortunate effect of this is that this drives preservation resources in two directions. One of them is to content where the copyright is owned by the organization doing the preservation, because then you don't have to get the lawyers involved. And the other one is to content which has a low lawyer hour per byte value. It may take an awful lot of lawyer hours to negotiate with the big publisher, Elsevier say, but the result is you get an enormous number of bytes. It's in our experience really easy to negotiate with a small publisher, but you only get a small number of bytes. And the result is that the lawyer hour per byte number for the small publisher is way, way higher than it is for Elsevier. This effectively focuses preservation resources on content that's not really at risk because it's the foundation of the business model of large publishers who are not going away. The average large publisher has been in business longer than the average library. For content that you publish, there is a way to avoid the hassles. Creative Commons licenses permit everything that you need to do for long-term preservation. It's really important that if you publish stuff, you do it with a Creative Commons license. So we're facing a need for preservation that's vastly greater than we can afford with current preservation technologies. We need to think through what is needed, how we can reduce the costs, and how the costs we can't eliminate can be paid for. In order to do this, we need to figure out the things that we thought were problems that aren't based on our experience so far, and also to look ahead to find the things that we didn't realize were problems that will be. So we've already seen two things that have consumed a lot of attention and resources that turn out to be non-problems. Any format with an open-source renderer is safe. If you want to preserve stuff in a format that doesn't have an open-source renderer, or if you're not happy with the quality of the rendering, the right answer isn't to collect format metadata so that with luck, someone in the uncertain future will eventually solve your problem for you. It's to develop or fix the open-source renderer yourself right now. Similarly, metadata, in general, is much less of an issue than we thought. There are two ways of getting it. By hand, it's simply too expensive for the scale that we need to operate at. And getting it by running a program is feasible. At the scale we're talking about, actually running a program that scans the entire data gets quite expensive. But it's almost certainly cheaper to do that than to run the program and save the metadata that it generates because why would you believe that you won't be able to run the program again? The advent of the web has transformed the very concept of a digital document. We've thought of preservation in a way that implies a static, isolated object that can be frozen and later thought out and viewed on its own. That didn't even apply to web 0.9, which, as with all new media, started out by recapitulating the behavior of the medium that it was replacing. So reading the very early web was like reading a book except there were links. Typically, when we throw out a preserved web page, the links don't work. Web 1.0 made the first real change because it started inserting personalized advertisements. This meant that everybody who read the page saw something different. No one who's preserving web pages preserves the ads. It's like they have no value, but here are three cultural artifacts spanning over 100 years in which the ads play a really important role. Now, why is it that it's only important to preserve advertisements when they're not advertising any real product? So web 2.0 poses a really major problem for preservation. Pages are not merely different for every reader. They change all the time as you read them, so they're different for the same reader through time. So what exactly does it mean to preserve one of these pages? It's unique, it's dynamic, there's no canonical version of it to preserve. The things that are being published right now are services, they're not documents in any meaningful sense, and we don't know how to preserve services. So let's look at some things that are clearly worth preserving. Future historians are definitely going to want to study the 2008 election, and to do that, they'll need preserved blogs, and not just preserve the content of the blogs, but also preserve all the YouTube videos of Makaka moments and so on, and the pictures on the picture sharing sites like Flickr and so on that they're pointing at. All of them preserved in such a way that the links between them keep working. The technical, legal, and scale obstacles to doing this are formidable. This was also the first election where campaign ads were inserted into online video games, and where election meetings took place in Second Life. Even if you could get the data for these games and worlds and invest in the big server farms that are needed to run them, these artifacts are dead without their community. How many of you remember MIST? It came out in 1993, right? It was this really beautiful virtual world that you explored, and after a while, you came to realize that you were the only person there. And a little while later, you came to understand that the whole goal of the game was to figure out why you were the only person there. And MIST was a great game in its day, but there's no way that it would succeed against things like World of Warcraft or Second Life. Or again, many of the more interesting election sites mashed up data from, say, the political contribution databases with Google Earth. Without Google Earth, they just don't work. You know, is Google Earth forever? Maybe we need to preserve the virtual planet as well as preserving the real one. So the 2008 preservation buzzword of the year was sustainability, because we came to understand that we couldn't afford to preserve even the stuff that we knew how to preserve. And that was before the economy melted down. Looking forward, we can see that there will be a lot more bites to preserve, but also that preserving them will be a lot more difficult. And presumably, therefore, a lot more expensive. And there's another inconvenient truth. Bites are a lot more vulnerable to disruptions in the money supply than paper is. They're kind of like divers in old fashioned hard hat diving suits dependent on air being pumped down to them from the surface. We need to make preserved bites a lot more like scuba divers that carry their air around with them in a tank and only need to be refilled at intervals. Sort of the only way of giving bites a fighting chance to survive hard times is to give them an air tank, which is to endow them. And endowing data raises the cost per bite a lot at the beginning of the process. It actually reduces it out into the future, but it shoves the demand for money towards the beginning of the process. And the result of this is going to be that fewer bites are gonna get preserved because you've just raised the price of doing it. But if we collect huge volumes of data, huge numbers of documents, and then have to throw them all away because we can't afford the storage charges when economic hard times return again, what have we really achieved? So we're going to be short of funds, which means we need to triage what gets preserved. And traditionally, this has been collection development. And collection development has its role in the digital world has been undercut by the whole switch from purchasing content to renting access to it. So if we're finding it hard to do digital preservation, to do collection development for digital content, at all, how are we gonna figure out how to do it at the kind of scale that we're gonna need? To sum up, we can see that digital preservation faces some really difficult questions at all levels. What does it mean to preserve dynamic personalized content? How can we figure out what services the dynamic content depends on? And how can we figure out what services those services depend on? And so on, add in for an item. Because in this mashed up world, you still need to ask permission from everyone. And how are you even gonna find everyone you need to ask permission of? And above all, even if we get very creative at addressing cost reduction, how can we possibly afford the industrial infrastructure that's needed? Do we have to rent it from services like Amazon's S3? So the speaker after Kirk's acceptance speech at the FAST conference was Alyssa Henry, who runs S3 for Amazon. And she gave a fascinating talk, a lot of really important lessons. But during the talk, she said that S3 carefully monitors both availability and reliability of their service. And she actually gave specific numbers for availability. And said that S3 offers a service level agreement with penalties if they don't make three nines availability. But there was something missing from her presentation which I bought up tactlessly at the first question, which was that she didn't give any numbers for reliability. And certainly wouldn't commit to any service level agreement about reliability, but for preservation point of view, reliability, not availability is what matters. So if the services that we need to buy from won't commit to a level of reliability or even tell you what the level of reliability is, that's kind of a shaky business, a shaky basis on which to build preservation systems. So we can't just throw up our hands in despair at all these really difficult problems. Society's transition to the digital medium is inevitable. It's basically happened. And we need to make sure that the digital medium replaces all of paper's functions in society, not just the ones that it's easy to do in the digital world. Paper is a durable, right once, tamper evident medium. The digital medium as we currently use it has none of these attributes. Society has come to depend on the kind of fixed, durable, tamper evident record that paper provides. To use just one example, the federal government is working hard to ensure that all government information is distributed through a single huge web server run by the executive called FDCIS. So in George Orwell's 1984, Winston Smith's job was to rewrite history to keep up with today's fast-changing ideologies. And FDCIS would be Winston Smith's dream machine, point-and-click history rewriting. So I don't know where the campaign to get Karl Malamoud appointed as the government printer has gone, but it would be well worth appointing him because Karl understands this problem. So to end up, what's to be done to dispel some of the gloom that I've been spreading since the happy ending? First, encourage everyone to pay attention to the good news that happened in the first half. It isn't nearly as difficult as we thought to collect and preserve content. Just go do it. And for anything that you publish, please use Creative Commons licenses. Even if you don't think it's worth preserving, Creative Commons licenses allow someone else to disagree with you without getting the lawyers involved. Second, let's focus on the fact that the open-source repositories like Sourceforge contain the infrastructure for digital preservation. There are no technical, legal, or scale barriers to preserving it. Any institution can do it. Several institutions should do it. If those repositories go away, digital preservation has some really big problems. Third, let's make sure that what's in those repositories is as useful for digital preservation as we can possibly make it. Working to improve the set of renderers that are available for less popular formats and working to improve the quality of the rendering of the open-source renderers for the popular formats is the biggest bang for the buck programmers can provide for digital preservation. Lastly, we really need a significant research program addressing the problems facing digital preservation. Two issues really stand out, and neither of them is really the subject of any significant research at the moment. The first is, we need much cheaper and more reliable ways of storing large numbers of bits. That's a very, very difficult technical problem. I wrote a paper about this at the last iPraise meeting pointing out how difficult this really is. And we need to figure out ways of preserving the kind of dynamic, personalized content that's being published now, not just the static webpages that were just starting to be published as Jeff was writing his seminal article. So thank you for your attention, and I hope I've allowed plenty of time for questions. There's a microphone, and if you could state your name and affiliation before the question. I hope we've sparked some disagreement. Or maybe just that everybody is so depressed by the last part of the talk that you're slumped in your chairs thinking, oh my God. Okay, there is some question. No, that's good. Is that microphone on? No, but basically. I have an easy one for you to start you off. That's much better. You talked about the market forces that are driving document formats, which we have a lot of to a common format so that preservation becomes kind of trivial. You get it for free, but I'm not seeing that in other sectors and other industries. So for example, in the CAD market, that competition still exists and is getting worse. It's not better. So the difficulty of preserving those formats is getting worse over time as more and more people adopt tools like that. So would you generalize what you've seen in the document world to other types of software like that? And can we forget about the problem and just assume that sooner or later it'll consolidate and normalize or not? Yeah, I actually guessed that you were gonna ask this question. Sort of, there are a number of responses. The first is that at some level, it just doesn't matter because if you keep the bits for the data and you keep the bits for the program, you can run an emulation of today's PC and use that to do the rendering. So Jeff was right that emulation is a really important technique for avoiding format obsolescence and it really works well. And it works well because forces in the IT infrastructure market have realized that that capability is essential to running the kind of industrial scale server farms that drive the web. So it's going to be, that capability is going to be maintained irrespective of preservation needs. So one answer is who cares? You just emulate the today's PC, you keep the bits for the operating system and the application and all the data and everything will be fine. And I think that's a very powerful argument and that's one of the areas where Jeff was really right. Then again, you can argue that the large scale CAD systems are still a relatively immature market in Brian Arthur's sense and that over time, there will be a consolidation in that market. I think that's quite likely, but it will take a long time. Different markets have different timescales and the consumer market tends to have a much, much faster timescale than industrial markets. But also in another sense, it doesn't matter again because what matters as I pointed out is not when new formats are introduced but when old formats can no longer, are removed. And I think that given the scale of the investments that their customers make in using these systems that these companies are going to find it even more difficult than Microsoft to remove support for old formats. And I know that, so my experience with CAD goes back into the 70s and stopped in the early 80s, but even in those days, manufacturers were removing support for old formats and they were getting grief for it. I can only imagine what Katia or whoever would encounter if they seriously tried to do that right now. I don't know if that's three adequate answers to your question. Yeah, thank you. Hi, Jeremy Frumpkin, University of Arizona. So first I wanna thank you for a very thought provoking presentation, it leads to many, many questions, but I'll just ask this one. During your presentation you touched on but didn't go into depth the topic of open data. But you did mention, as an example, mashup type of things with Google Earth or Google Maps and such. So it seems to me there are many challenges then when you're talking about scientific applications, research applications that interact, that themselves may be open source, may have open data, but interact with other critical sets of data and critical applications which are not open source and open data. And I think that's at a scale right now the Google Maps mashup type of example is a great example for that sort of thing. What are your thoughts on those challenges and how we might, you know, because that's an increasingly problem of scale and how we might be addressing those sorts of things? Right, this is a very difficult problem and I actually don't think the solution even has, and I don't think the question of whether the services are themselves open source actually even makes a lot of difference. If you think about, so to take the example, make it uniform, if we think about the way that scientific workflows are coming to be something that people who do experiments in silico sort of mashup scientific workflows and typically they're all, the data is available, the workflows are all open source, but they're still very fragile. And publishing one of these things in such a way that it's gonna last is very tricky. They, talking to people who use them, they tend to break a lot and this is a big problem. And it's very hard to see how to handle it because as someone wanting to do, wanting to preserve these kind of things, you have very little leverage with the people who actually have their hands on the services. So yeah, I agree, it's a very, very difficult and very important problem and even in the case where everything is notionally open, it's still a big problem. And for a lot of things that we really want to preserve, the services are very far from open. It costs a lot of money to run something like Google Earth or even much smaller services that people depend on. And even the government services if you look at the way that the government publishes stuff on the web, they actually publish it in ways that for example, typically hide the links to the actual content behind chunks of JavaScript and so on to make it as hard as possible for you to crawl it and get copies into your own hands, even though that stuff is government information and should be available to you. And you can see that there's a lot of temptation for them to keep introducing gratuitous incompatibility into the services that they export so that it's very hard for people outside to embarrass politicians by revealing the information that's in them. And this is really a huge problem and I'm certainly not gonna stand up here and claim that I have any idea about how we're gonna solve it but it's really important that we start doing the research to find the solutions to this problem. Tim Cole of the University of Illinois. I wanna return to the format issue but I wanna thank you and agree with you that some of the big issues that have been neglected in too much tension to format migration have been things like the scale and the intellectual property and the service issue. But I would argue for a more nuanced approach on the format. As librarians as preservationists, we occupy a space for the most part, most frequently between publishers and the users. And you lay out the arguments very well as to why things published, when they're published, they're published in a format that BIT will tend to suffice for, BIT preservation. But we also in many cases occupy a space directly between the content creator and the user. Think of university archives that preserve papers of faculty and researchers. Think of urban management libraries. And that context, content creators do not share the same motivation publishers do when they create materials. They are much more likely to adapt newer technologies and compatible technologies that they feel and give them some small advantage in how they create new material. And in some cases that's trivial and doesn't matter. People presenting at this conference, many of them will create in PowerPoint, Microsoft PowerPoint, which is not, which is a more changeable format, but then publish in a version of archival PDF when they actually put their things on the web. On the other hand, if you're talking to mathematicians or physicists that work with high levels of mathematics, they tend to create using tech and law tech, often using very fragile and ephemeral macros, and leave it to the publishers of their materials to convert those to PDF. If all we preserve are the PDFs of those articles, we do a disservice in the sense that the tech and the law tech properly preserved have much more potential to be able to search by mathematics, by function, for example, than the published PDF allow. So there are still, I think, going to be at least a subset of cases, maybe a predominantly small number of cases, but cases where preserving and recognizing that formats are troublesome and have to be migrated, in some cases, or may have to be emulated in a very high fidelity will still be important in digital preservation for some select subset materials. Yes, okay, let's make this personal because what you've been watching was PDF, and I generated it with OpenOffice, so the source file is ODF, but you shouldn't care about that. PDF is everything you need, but almost all my published papers were done in tech and submitted to archive.org. But let's look at that process. I actually spend a lot of time making sure that the way that it would get converted into PDF was correct. Not all authors are quite as concerned about that, perhaps. I'll get to that. There's a submission deadline, and what do you spend the day before the submission deadline doing you spend it, fighting with tech, to get down to the page limit? And that means that you need to be concerned about exactly where the line breaks happen and things. And I actually don't want anybody, archive.org actually has the tech for all these things, but I'd much rather they didn't just had the PDF. What the real point here is that as, ah, I forgot my, I'm terrible with names, the chemistry guy, he complains a lot about PDF, because Peter Murray Ross, thank you, because as you correctly point out, it's much more difficult to data mine stuff out of PDF than it is out of something that has more XML-like structure. But I think the way that this is gonna end up getting resolved is that you're gonna be able to put that stuff into the PDFs. Adobe has actually quite an incentive to improve PDFs capabilities in this way, and that's kind of the right answer. There's a lot of, there's a lot of archivists tend to want to preserve the source files and sneer at the result of the source files, but I think that has a lot to do with a sort of pre-web, pre-publication view of the world. The published stuff is actually a lot more suitable for preservation. The issue is it isn't quite as useful for things that you wanna do with the preserved content like data mine out of it. And I think the better way of approaching resolving that problem is not to try and bang your head against the preservation problems of source files, it's to improve the presentation medium in such a way that it retains the information that you want out of the source files. And the big benefit of doing that is that that information would then be available to the web search engines. So I really think that Adobe has a lot of motivation to go in that direction, and that would make everybody's life a lot easier, and it would keep Peter Murray Rust quiet. Thank you. Richard Boulderstone, British Library. I really enjoyed your presentation and particularly around bringing economics into the center of digital preservation because it clearly plays a big role. Right. I suppose on your final slide about things to worry about, don't we have to, I mean, libraries and archives have never kept everything and have never really tried to keep everything, and shouldn't we be thinking very carefully about what it is we are gonna preserve and do research and develop models in that area versus just assuming that we have to create these massive stores of more and more stuff, which actually may be less and less useful as time goes by. Yeah, that's a very good point. I did mention the need to think through the curation process to, you know, I find myself using jargon. This is terribly embarrassing. To re-engineer the curation process in such a way that it's a lot more cost effective. I think we need to pursue both aspects. The question is, how can we best use the limited funds we have for preserving the most useful information? And one of them is to get better at choosing which information to devote dollars to and the other is to reduce the per byte cost of preserving it. And I think they're both directions that need a lot of work. The important point I'm trying to make is that by focusing a lot on the kinds of things that Jeff was worrying about at a sort of micro level of preservation, we've driven the costs up to a level that means that we are enormously too expensive for what society needs us to be doing. And at the scale that the British Library operates at, you've got to be very conscious of the fact that there's a lot of stuff that you would like to preserve that you just can't afford, right? Yeah, I mean, I think we need to seek a balance, I mean, the Internet Archive, which feels like a sort of brute force, save everything kind of approach, and then highly selective manual-based approaches. On the other hand, we need to try and find some sort of balance or model or approaches to figure out what we can select. And I think that's traditionally been a pretty hard area to solve and lots of things we've kept by accident, actually. But it's an area that it seems to me that we have to spend a lot more time thinking about. Yeah, I'm trying to agree with you. Thank you. But we're clearly, the Internet Archive is an incredibly valuable resource and it's amazing how well they do at the, let's totally ignore collection development and just try and get as much stuff as we possibly can approach. But it clearly doesn't do a lot of the job that society needs done. It can't get at subscription materials. It doesn't do a good job of preserving open access, academic journals and so on. There are huge holes in it. But I think the reason why I thought it was a compelling example is because, suppose that we're right, it costs 50 cents a gigabyte a year to do it. Even at that level, it's still too expensive. And we need to do everything we can possibly think of to drive the costs down so that we get even cheaper than the Internet Archive. And a side effect of doing that would be that the Internet Archive could use those techniques and do a better job than it currently is. But clearly, as I said, I did research, I published a paper based on data that showed that the Internet Archive was losing stuff at an incredible rate. But in a sense, it didn't matter because the archive is just, the value of the archive is that it's kind of a statistical sample of the web. And losing stuff is just adding noise into this sample. It doesn't, a small amount of additional noise is not degrade the value of the sample very much. And that's a great approach for the Internet Archive, but that's not something that represents any kind of digital preservation that we wanna talk about. And so we need to end up somewhere that's a lot cheaper than the Internet Archive but a lot more reliable. And that's a very, very tough engineering problem. And we also need to get better at the sort of curational processes of trying to figure out what's important to save and going through the legal hoops of getting permission. Well, okay, if you're the British Library or the Library of Congress, you actually don't have to go through those legal hoops, although in practice, I mean- Oh, yes we do. In theory, the Library of Congress at least has this magic get out of jail free card. But in practice, you can't actually use it for anything. So figuring out how to reduce the lawyer hour per byte value of content is also is an important part of this whole cost reduction exercise. So I hope I'm really agreeing with Richard a lot about the need to research all aspects of trying to get the cost of doing digital preservation down to the kind of values that will let us do what society needs us to do to preserve this stuff. We got lawyers as well. Yeah, I'm sure you do. And they're expensive too, right? Okay, I think we have a break coming up. David will be here. Please join me in thanking him. Thank you.