 Good afternoon, I'm David Rosenthal and I'm here to talk about the really big problem in digital preservation, which is money, so This all started as you'll see a year ago and has been developing into quite a conversation in Various conferences and blogs and so on But the way I'm going to structure things this afternoon is and to talk about the various business models for long-term storage and the likely effect of future technologies on them and this is going to point up the fact that we need Economic models of long-term storage compare costs through time and make comparisons between different storage technologies and Then I'm going to be talking about the economic model that I've been building over the last few months and I'm going to be asking I'm going to be showing you some of the things that it can do and Asking what you want an economic model like this to do because we're in the process of ramping up a significant effort between me and UC Santa Cruz and SUNY, Strangly Brook and so on to work on this So everything you're seeing today is either me arguing about stuff or this is very much work in progress So I'm sure you've all seen this amazing graph, which is criders law. It's graphing the Decrease in cost per byte of storage on discs over the past 30 years and it's been dropping exponentially and This has led people to think of three basic business models for storing data there's renting the space Amazons S3 is a a Is an example of rent of the rental model You can monetize the content if you think about Why Google is able to store your Gmail? It's because it sells adverts on accesses to the The mail and in effect think about how How often you access your really old mail? Not very often. So how much money is Google making over storing you the back Content of your mail archive. It's not making very much Which is the reason why Google limits the amount of mail that you can store and modulates the rate at which they increase the amount of Mail that you can store to make sure that they make enough money out of your access to your recent mail to continue to pay for your old mail and The remaining one is endowing the data in other words you deposit the data with a sum of money which is supposed to be adequate to pay for its preservation forever and Last year at this meeting Search Goldstein of Princeton explained their Data space service, which is actually running this model and I stole a couple of his slides In which he explained the reason why he thought it worked So what he's really saying here is that providing the Costs of storage continue to decrease at the same rate that they have historically That if you charge twice the initial cost of storage you can store it forever now and In the question and answer session I was somewhat skeptical about this and discussion continued But there's no question that endowment of these three models endowment is the one that We should try and make work if we possibly can and the reason is that rental is in inherently It's sort of immune from whether the disc The cost of storage continues to drop or not if the cost of storage doesn't drop and the rental doesn't drop Monetizing it doesn't really work for preserved content because let's face it Nobody ever accesses the stuff Or at least nobody that you'd be warned interested in selling ads to So endowment is the one that we need to make work and for things like NSF data management plans The ability to roll up the long-term cost of storage into the project funding is kind of essential because there is no continuing flow of funds into the future that you can count on to pay for the storage and stored data is Unlike paper it doesn't survive benign neglect. It's very vulnerable to to interruptions in the money supply So we should want to make endowment work But I was a bit skeptical about the idea of charging double the storage cost Which is of course I Was right to be skeptical about that because actually Surge is charging 30 times the cost of storage So what are the assumptions behind this model that the this simplistic model that surge was presenting? Well, the first one is that storage is the major cost of preservation and with storage media that the the stuff that's dropping 20% a year is a major cost Then there's the assumption that criders law will continue at least for a decade or so After that the exponential has gotten small enough that it doesn't really matter whether it continues very much And the third one is that the service will give you in the future what you paid for in the past Let's look at each of these These are numbers from Vijay Gill at Google showing that Space power and cooling are about 60% of the total cost of owning a server in one of Google's farms and This sort of chimes with numbers from the San Diego Supercomputer Center which has been tracking storage costs over a long period of time and Their numbers suggest that media cost is about a third of the total so if and only if All the other costs like space power and cooling and so on Go with Effectively the number of disks and not the size of the data that you're storing then Surge is right story storage is in effect the the major part of Preservation costs But if they don't Then instead of dropping 20% a year you're dropping a third of 20% a year 7% a year and the numbers look much worse So this is another graph that I'm sure you've seen this is Moore's law which has been going for 40 years not 30 years and It looks like it'll keep going forever but Keeping exponentials going forever is kind of hard and this is what's actually been happening to CPUs a Few years ago Moore's law Continued to deliver more and more transistors per chip But more and more challenges per chip stopped delivering faster and faster CPUs because they ran into all sorts of interesting problems about heat dissipation and Distributing the clock around the chip and so on and they also ran into business issues, which were that There were things that people thought were a lot worth a lot more worth paying money for than faster CPUs such as CPUs that burnt less power So that even if the underlying exponential continues it may not continue to deliver What you think of as? What it's been delivering in the past so you think the criders law is about dollars per byte But actually it's about the size of a bit on the surface of the the disk And there are a lot of good reasons for believing that criders law is in trouble One of them is that the desktop PC market is going away netbooks and now tablets are destroying the market for Desktop PCs which is destroying the market for three and a half inch disk drives You can't fit a three and a half inch disk drive into a laptop So There are going to be much smaller volumes for three and up for consumer three and a half inch Drives which are the drives which big storage farms are mostly built out of and the next one is that the The curve has actually stopped We're Just barely getting three terabyte drives now if we'd stayed on the curve We should have got four terabyte drives almost a year ago and the reason is that there have been five generations of the technology underlying Current disk drives which is called perpendicular magnetic recording The anticipated industry road map was we would have switched to one of the successor technologies About now the successors that people are talking about a heat-assisted magnetic recording and bit pattern media Unfortunately the transition to both of these have turned out to be enormously more difficult and expensive than anybody anticipated and so There's a desperate struggle right now to stretch the existing technology for another generation and To give you some idea about how desperate This is the way that they're proposing to do it is called shingled rights Shingled rights sound pretty innocuous, but what that actually means is moving the tracks on the disk close enough together that writing One track will affect the tracks alongside it and using extremely sophisticated digital signal processing technology to Unmix the tracks when you read them You wouldn't do this if you weren't desperate So this is this is a graph from Dave Anderson at Seagate who is one of the one of my favorite people in the storage industry showing the the the progress of disk technology Towards this line at the top which says single particles super paramagnetic super paramagnetic limit Estimated This that line is where the bits on the disk can't get any smaller because the Magnetization of grains that small becomes unstable and they don't know exactly where it is But we're clearly going to run into it sometime between about 2019 and 2026 and in the meantime we have to get into these various new technologies, which are going to be very expensive and Then there's the whole question of Whether you're actually going to get what you paid for when you pay up front for storage If you think about this, this is like the insurance industry, right? You pay up front and then something happens and you collect money in the future And there are good reasons why the insurance industry is very very heavily regulated by governments because the temptation is to take your money and run and You basically have no leverage You've paid the money. They've got your money and and now you claim because you know You have some medical condition and they say well, I'm sorry. That's not covered So you need some kind of escrow service which regularly audits the storage Service that you're paying for and just discovers whether it's actually preserving the data and If the service stops doing it Transfers the data to a successor. Well, the problem with that is With all these storage services it actually costs money to get the data out And it costs money to put the data into the new service So you need to have reserves to cover The transfer of data from one service to another and then there's the problem that these transfers actually take a long time You know if you've got a few hundred gigabytes in one of these storage service You can't just press a button and instantly transfer that to some other service and so you discover that the service is failing and Well rather your escrow service discovers that the services failing and initiates the transfer and now it's a race between you your transfer succeeding and the underlying storage service going out of business Okay, so supposing you set up a trust to buy cloud storage forever And it's using s3 if it's using the for nine service You're going to have to deposit and we're assuming 25% per year decrease in in Amazon's Rates for storing which hasn't actually happened But let's assume that it had you would have to deposit $4,700 for every terabyte if you were using the 11 nine service you'd have to deposit $7,000 per terabyte well long-term storage of Say a petabyte of stuff needs vastly more than 11 nines of reliability So Google isn't charging enough to provide the level of reliability you need for long-term storage So clearly since Princeton's charging less than either of these two and Princeton's economics are unlikely to be any better than Google's Princeton isn't charging enough and what this means is we have a serious marketing problem Because now you have to go to the people who have the data that you want to preserve and say you need to give me 70 times as much as the cost of the disk to hold your Storage in order to keep it for the long haul which is a pretty unconvincing story Okay, so future technologies. There's a wonderful paper by Mark Crider of Crider's Law and colleagues at Carnegie Mellon called after hard disks what comes next which is a pretty you've got to admit There's a pretty hot topic right now because of flash storage and so on And they assume that Crider's Law continues at least through 2020 and that means that you that in 2020 You should be able to get a 14 terabyte two and a half inch drive for $40 Which sounds great? except Do you actually want a 14 terabyte two and a half inch drive? Think about backing up 14 terabytes of data out of your laptop How exactly gonna do that? So What's gonna happen is you're not gonna get a 14 terabyte two and a half inch drive for 40 bucks You're much more likely to get something like a a six terabyte one inch drive for about 30 bucks and The other thing they point out is you know what I showed you on on Dave Anderson's graph that the curve stops by 2026 and So they look at 14 different solid-state storage technologies to see how well they're gonna manage to compete with hard drives and It's pretty depressing Because the solid-state storage is all built on on Semiconductor fabs and semiconductor fabs take a long time to build and So there's this roadmap, which the industry is working on which tells you how well they're going to be doing each year out into the future and You can tell that in 2020 Roughly speaking how small it a Storage cell you can build in each of these technologies so you can tell how dense you're going to be able to get with them and It's not Very competitive with a 14 terabyte two and a half inch disk drive so Will solid-state take over We can definitely say no it won't because The storage technologies that are going to be in the market in 2020 Have to the the you have to start building the fabs to make them in the next couple of years and so the people in the Companies that build the equipment that goes into these fabs Need to be working on it about now and So we know roughly speaking How many Wafers full of chips we are going to be going through fabs in in the years running up to 2020 and We know roughly How much storage you can put on each each wafer and the answer is you can't Build enough wafers to displace the whole market for disk drives. So solid-state Memory and hard drives will share the market the way they actually share them now I mean I look around there are a lot of iPads and iPhones and things like that and they don't have hard drives in them because even though solid flash memory is more expensive than hard drives it has other attributes like Not breaking when you drop it That mean that it's more valuable in and it's worth putting it into tablets and iPhones and Digital cameras and things like that. So the question is which segments favor which technology and from our point of view The question is can you use solid-state memory for long-term storage? Well, it's much more expensive to buy But it's much cheaper to run and it has much longer service life. And so as we saw earlier Running costs are about two-thirds of the total cost of long-term storage. So It's arguable that it makes sense to use even flash memory for long-term storage and Ian Adams Ethan Miller and I wrote a paper at UC Santa Cruz laying out a an architecture for how to use flash memory for long-term archival storage and Arguing on the basis of a rather simplistic economic analysis that actually It was pretty competitive But reading the paper after I'd written it And Thinking about it. It was obvious that this level of economic analysis was about as sophisticated as search Goldstein's and It wasn't enough I mean everybody can understand that if you have an unlimited number of dollars It's easy to keep data forever. So you're gonna have to make trade-offs Of how much you're gonna spend against how much reliability you're gonna get and how do we make good trade-off decisions? because Money is critical and so making good decisions in this area is important So we need quantitative models of reliability and I've already written and spoken quite a lot about how hard it is to build quantitative models of how reliable storage is So the next thing to do was to look at the other half of the equation and say well, can we build? cost models and It turns out that these cost models are quite hard to Because Storage for the long term is not a one-off decision. It's The life of the data is much longer than the life of any of the hardware systems. You're going to use to store it and the service life of different technologies varies through time and systems and the media and so on have different purchase costs and they have different running costs which also vary through time and The interest rates you need to pay in order to finance the purchase of the hardware vary through time And so at each stage of the life of some data you have a decision to take which is fundamentally whether you're gonna continue to use the storage That's currently storing it or whether you're going to replace it with newer technology And if you're going to replace it, what are you going to replace it with? and so you need to take this decision every year or so and In particular what I ran into is that I've been working for the Library of Congress on the question of using cloud storage To as as the storage back end for lock systems And I'd gotten some versions of what locks in the cloud means working and the other should work shortly, but as I started thinking about this it was obvious that the question people were going to ask me when I was done was Does it make economic sense to use cloud storage instead of local storage? from my locks box and this is an apples versus oranges comparison because Local storage has capital costs and running costs in cloud storage has effectively only running costs And somehow you need to compare these two on a on a rational basis And the rational basis as you'll find in all the economics textbooks is something called discounted cash flow Which is a way of comparing costs now with cost in the future So you assume an interest rate and you invest some money at the at the assumed interest rates so that When you need the money The The principle plus the interest that you've accumulated over that time equals the cost that you're going to have to pay and The amount that you invest now is called the net present value of the future cost and Then obviously question is what interest rates do I use? Well, if you're absolutely certain about the future cost you can use the Treasury bond rates because every day the US Treasury publishes the the the yield curve for inflation protected US Treasury bonds which connects the the term of a loan to the Risk-free inflation protected interest rate Now obviously you In practice, you're less certain about the future costs than the US Treasury is So you have a risk premium added to the interest rate which you can sort of adjust to account for how uncertain you really are and This is the standard technique that all investors use to assess the return on future investments or future costs that companies are going to incur and What could possibly go wrong? Right? We know how good these guys are at their job Well, here is some research from the Bank of England Andrew Hall Dain and Davis looked at the history of hundreds and hundreds of companies stock prices and Their actual earnings and What they found was that investors were systematically using Discount rates that were way too high five to ten percent too high Another way of looking at this is that investors horizons were Affected by short-termism And that the short-termism was increasing through time and five five to ten percent extra interest rate in The interest rate environment we have at the moment where the US government can borrow at negative interest rates inflation protected Is a big number in effect what this says is that it's almost impossible to make Productive investments payback So discounted cash flow doesn't really work in practice But actually the news is worse than that because discounted cash flow doesn't even work in theory This is work by a doin farmer of the Santa Fe Institute and John Guillain-Copoulos of Yale Where they point out that this is this is this is Complex and I'm not claiming that I fully understand it. This is my explanation of what what's going on so by assuming a single constant interest rate into the future the computation never sees Periods of very low interest rates like now or periods of very high interest rates and This would be fine if the outcome was linearly Connected to the interest rate, but it's not and so what? what farmer and Guillain-Copoulos showed was that by changing from having a constant interest rate to having a Bounded random walk in interest rate space. They could change the outcome in terms of the Future that the net present value of some future cost from being very small to being infinite and This is a real problem What that what says is that you can't do discounted cash flow calculations What you have to do is to run a Monte Carlo simulation where you track The the the path of your system through a Suitable distribution in interest rate space So that's what we've been building What we've done is to build a Simulation environment which includes a number of aspects of long-term storage and it includes Various interest rate simulations as well So we have yield curves which are relate the term of loans to the interest the inflation protected interest rate on on that loan We have loans We have assets Endowment need starts out with an asset and the asset earns interest while you don't spend it and We have technologies So each technology has you know purchase cost model For discs it might be a purchase cost model that follows Crider's law It has a running cost so for cloud storage for example This would be the rental cost for the space plus the bandwidth plus the compute cost for doing the necessary integrity checks and so on Then it has a move-in cost So for cloud storage for example, this might be the bandwidth cost of uploading the data to get it into the cloud in the first place and similarly a move-out cost For moving out when you're when when you're done so that the cost of migrating between one technology and the next is the move out cost of the old technology plus the moving cost of the of the new technology and it has a service life and If the model chooses to deploy this technology, there's a purchase loan which pays for the purchase and the migration costs and the the duration of the loan is the service life of the of the technology and The interest rate comes from the interest rate model So how does it work? Well each year it sets the yield curve In the way I do it at the moment is it starts at a random year in the past 30 year history of interest rates And then runs time forward and backwards and forward and backwards through this 30 year history Then each year we we the model generates some new technologies with different pricing and Running costs and so on And then for each technology. I'm currently using it runs the hardware upgrade process to decide whether to Upgrade from the old technology to some new technology and it does the accounting for the running costs and the loan costs and so on and Deals with the fact that you may have borrowed money for five years to buy some technology and then Decided after three years to replace it with something else But you still have to carry on paying the the loan for the remaining two years and This is the the hardware upgrade person one of the really surprising things so far about this model is that This is Even this is not complicated enough to model real decisions in this area it's I Was amazed how complicated this was and I can already see that it doesn't actually model the way people really think about this You have both a Service life and a planning horizon This is this is in effect where the Bank of England's short term is and comes in because You can buy I mean suppose you can buy a technology that has a 20-year service life You're not actually going to commit at the time you buy to using that technology for its full 20 years Because who the hell knows what's going to happen in in in 20 years You're going to think well I need this to make economic sense in some time a lot less than 20 years like say seven years maybe so there's a planning horizon and there's a service life and If you're the technology the old technology is at the end of its service life then you you have to replace it Irrespective otherwise you need to compute the cost of keeping the old technology over the shorter of its service life and the planning horizon and Then for each new technology you need to compute the cost of the new technology over the planning horizon and If it's cheaper then you replace it And that cheaper has to include the migration costs Okay, so this is this is where we start showing graphs This is assuming Criter's law is observed faithfully and storage costs drop exactly 25 percent every year and running costs represent two the running costs are set so that they represent two-thirds of the total cost of ownership and This it uses the treasurer's interest rate database from 1990 to 2010 and the question that we're asking is as we increase the endowment that we Deposit with the data. What's the probability of the data surviving a hundred years? And as you see up to about five point seven times the Original cost of the storage the probability is zero You're going to run out of money sometime in the in the hundred years and once you get past that the probability rises pretty quickly to something around six point five percent where it gets to be effectively a hundred percent and What this really tells you is that if disk drives if storage costs are dropping 25 percent a year Then the random variation in interest rates doesn't have a big effect on the Size of the endowment that you need So this is the result of I Think ten thousand Runs of simulation So this is the result of about half a million runs of the simulation Where we are looking at we're doing that same computation for a whole set of different Criter's law rates of of decrease between five percent and forty percent a year and the the the sort of Plateau at the top is where things survive and they the sort of skirt at the bottom is where things are pretty much guaranteed not to survive and the the slope in between is Is where The the is where the interest rate is actually affecting things. So we take the 98 percent contour on that and We get this graph, which is shows the that the size of the The which shows the the rate of decrease of of storage costs and the Associated endowment that you need to get a 98 percent probability of surviving a hundred years. So you can see It as the the as the rate at which storage costs drop Increases the endowment you need falls pretty rapidly and then flattens out which is pretty much what we'd expect But it's still pretty expensive. I mean we're talking varying between 14 times the the cut basic cost of the storage and five times So this this is what I've been working on in the last couple of weeks, which is so There are I went through all these reasons why Criter's law Might not continue to work in the medium term But actually what happened was there were these historic floods in Thailand Which submerged most of the factories that build this drives and this drive price is doubled in a few weeks So obviously if this model was going to be any use I had to be able to simulate the effect of spikes in the in the cost of storage So this is modeling What happens if? One year the cost of storage doubles and then drops back over a two-year period and resumes its its decline So the the curve at the at the front of this graph is the curve from the previous one showing what's the effect of no spikes and then Succeeding ones behind it show the effect of this bike one year after you start two years after you start three years after you start And so on and you'll see that If disk drive costs if storage costs are dropping rapidly Spikes have very little effect If they're not dropping rapidly spikes can have a really big effect And the other thing you notice is that there's kind of a ridge in in the surface at four years and the reason for this is that there's Assumption here is that storage has a four-year service life And if you're unlucky enough to have the discs that the storage costs spike double Exactly at the time when your current storage is obsolete and has to be replaced you lose Whereas if the spike happens While you still have time left on your current investment in storage you lose much less Okay, so other uses of the model well one of the things we can look at is what's the effect of short-termism how? How long a planning horizon should we really be using and what's the cost of using a Planning horizon that's much shorter than we need to That's an interesting question. We haven't I haven't managed to run enough simulations to look at this yet Then there's this whole question that we were looking at with the dawn architecture in other words How much lower do the running costs have to be to make it affordable to pay X times as much for the the storage up front in order to get the lower running costs and There's the local disc versus cloud storage question comparing something which has purchase costs plus running costs with running costs only and Looking at things questions like, you know, so okay, so how fast has The cost of cloud storage actually been dropping There are a lot of obvious improvements to this this simulation We need better models of interest rates we need much better models of technology evolution than simply Plugging in different numbers for criders law we need to be able to Do better at investing the endowment than investing it in One-year treasuries, which is what the assumption is at the moment we need a better model of decision-making around upgrading the hardware and we the goal is to implement this thing as a as a As a website where people can go and play what if games to for scenario planning purposes and In In order to do this we're putting together a team at UC Santa Cruz and and Sunni Stony Brook to work on this problem and What I'm looking for is is feedback. You know, does this look like this could be useful? for you What other concepts does it need? I mean obvious ones that we have already come up our Replication policies at the moment. We're looking at one unit of storage but clearly you need to have copies on multiple units of storage and you should really be organizing things so that You don't upgrade all the copies of your Valuable content at the same time And Then there's the the real question that people want to know is is Connecting these cost models to reliability models and that's going to be really hard so I've left plenty of time for questions. So I turn it over to you Well, this is great. Thank you very much You know as we try to sort of struggle with how much should things cost this is going to be really helpful to I think all universities Couple of things you have data growth up there. I was wondering about There must be some kind of increments in which once your data Exceed certain sizes then things will sort of jump have you been modeling that at all? I mean like that the two terabyte to three terabyte you know sort of Jump would would probably make a difference And and then secondly you had a 70 X Number in when you were just describing describing the Princeton where did that come from? Okay? I can explain both of those so that the the step functions in in technology are Actually in the model I haven't been using them at the moment because It's actually Just getting to this point has been quite a lot of work And I wanted to choose some really simple examples that I could explain to people so that they could connect what they were seeing in the graph to their expectations and Putting in step changes in the in the technology turns The little I've done of it so far produces some rather strange-looking graphs Which I'm still trying to understand so I Haven't done that the 70 the 70 times number came from using the assumption that Amazon's Storage pricing would decline 20% a year and Looking at what it would cost to fund indefinite preservation by buying The 11 9s S3 service and comparing it with what it would cost to to buy that from the using the 4 9s service and That struck me as an interesting number, but the more interesting one would be to look at what it would cost to Once we get replication policies in there because obviously you need more replicas at the the fall in the 4 9s service than you do in the 11 9s service so and The reason for the 11 9 service costing more is that and underneath it Amazon has more replicas right, so There's a question about you're trying to investigate how Amazon is setting those prices and It's not at all obvious One really valuable thing for something we're working out on right now is trying to figure out What are the storage? Terms so if a scientist says okay, you must keep my data for ten years versus 20 years Yes, if your model would be very helpful to help us. Yeah, we some of those issues. Yes We the varying duration is built into the model everything. I've been doing this on in this talk uses a hundred years as the target survival But that's again just so I didn't have to explain too much And only one go My memory from the discussion last year was that one of the more powerful arguments was that staffing was it really accounted for in the Goldstein Argument, did you look at human attention time and institutional staffing? that's in this model that's buried in running costs and The right now the model assumes that running costs in effect That the that the running cost is a per drive Cost not a per byte cost so it goes down as the capacity of the drives a constant cost increases now That's probably not a good assumption but that fortunately that is one of the one of the pieces where The model is capable of modeling that I just haven't messed with those parameters There's one of the reasons why I'm looking at this and saying there's way more work here than I can do is that they're Even in its current rather simplistic form. There are a whole lot of parameters in here to explore the parameter space of and Running I mean even with relatively fast machines running Half a million runs of the simulation takes a few hours Thank you for this. It's really Be really interesting keep tracking this as you progress I was glad you mentioned environmental shocks and I would just throw out that you also have It's probably obvious to you but political shocks Local environmental shocks wars organizational shocks that sort of thing and also if if you're only using the interest rate years from 1990 to 2010 you have fiscal shocks like 1982 and that sort of thing you might want to Yeah wild range. Yes, as I say, we need a better interest rate model the reason for choosing The past 20 years is that that data is available straight from the treasuries website in on their inflation protected Yield curve database which made it really easy Clearly we need a better model I was also glad that you mentioned the possibility of differential retention schedules and it something that might be really useful in the long-term planning concept to use this model for long-term planning is to think about Factoring retention models in yeah looking at well What what what's the cost if we assume that 99% of this data? We're just going to be able to throw out in the next 99 years and only 1% of it We're really going to need to keep maybe there's a 30-year model That's the life of this researcher with this 10 more years at this institution in 20 years into retirement And then maybe we turn it off. Yeah, I I'm being completely serious Unfortunately, my impression is that the staff cost involved in figuring out what to throw away is Is probably lower than the cost of cost of keeping it But there is one once we get to replication policies There is there's there's some interesting Things we could investigate so that so for example You could gradually Decrease the number of replicas that you were keeping of stuff through time because Recent data is More heavily used than than a very old data on the other hand clearly there are some things like You know databases of observations and so on for which you don't want to do that and then again, there's this whole business with with gene sequencing where It turns out to be cheaper to resequence the genes than to keep the data There's a good questions. I'm actually remembering some of this for for notes about what we should be doing Hi, I've been wondering when I would first hear the term used digital data storage futures It kind of sounds like that's probably But Yes and derivatives of Futures on data storage and so on and yes I'm sure that that we can just ignore the fact that the computational basis for these things has just been disproved to Find out how useful What I first saw your list of assumptions about the cost of storage and it occurred to me that I wondered if you Are thinking about what may be a much more complex issue, which is the cost of curation that is You're talking about it sounds to me like all these assumptions are based on a model that only Contemplates bit preservation. Yeah, that's that's true clearly The cost of Keeping the data around once it's been ingested is Typically something like half the cost of doing the ingestion even using these fairly Large values for how much you need to endow the data with The there are numbers from the arts and humanities data service for example, which would suggest That that Between a half Some somewhere around a Half to two-thirds Numbers for those but I'm not sure if I understood what you were just describing So if you look at the overall cost of digital preservation Ingest is the biggest cost Essentially, I mean if you assume that we're gonna that Storage costs are going to disc decrease by some amount over time you end up seeing that for Essentially infinite storage in ingest is typically Still the biggest cost So you've paid more than half the total amount you're going to pay Upfront even if you rent the storage forever My question has to do with The costs associated with Logical reformatting the you know the that preservation. Yeah, I'm PDF a is you know, okay? That's great through okay I'm on record as believing that that those costs are nil because those migrations aren't gonna happen I'm sorry. There's not going to happen That's a whole different discussion, but the the If you want if you believe in in migration, you can fold that the cost of migration into the Running costs or into the move-in move-out costs There's a different that that's that's something that the model will cope with the Upfront cost of ingest is not something that this model is set up to deal with That was the point. I was trying to make I see. Yeah, so ongoing curation costs You can fold into the running cost of the storage for the purposes of this, you know to some extent hand-waving simulation But we're not dealing with the upfront cost of getting it in there in the first place Thank you. Hello, David. Um So have you found any way to test your Monte Carlo simulations against past stuff I have actually written Monte Carlo simulations a long time ago And I was always expected to test it against past situations in order to to validate it But I wouldn't know where to start, but I'm wondering if you've thought about it Uh Testing would be A rather grand way of describing what we've done We have that there isn't a lot of data to compare it with that the best numbers we have are the San Diego Supercomputer Center numbers and We've produced some graphs that look something like the San Diego Supercomputers Center's numbers but that involves a lot of Rather hand-waving Adjustment of parameters So I wouldn't I wouldn't put too much Faith in in them yet I think this is more valuable at the moment in terms of Getting people to think through the issues than actually making numerical Projections I Think it's going to take quite a bit more work before we get to the point where you could actually Place some some credibility in the numbers that in the specific numbers that were coming out rather than the generic shape of the graph We need we need to get to Having confidence in in that the shape of the graphs is is plausible Before actually trying to move the graphs up and down so that the numbers agree with reality But I agree that I mean the thing you know about Monte Carlo simulations is that They They The more parameters you feed into them the bigger the spread of results that you you get out of them and The less meaningful they become So we need to be we need to be careful about that too So yes, I as I say very much work in progress Don't don't place too much credibility in the numbers because after all you know the the biggest factor over Over the initial storage costs that's coming out of the graphs at the moment is 14 times Which is half as much as Princeton's charging so The numbers clearly need some need some manipulation Kevin Thanks very much for a really interesting presentation. I'm Struck you said That even just running a few of these simulations actually takes up quite a lot of resources and and I Imagine there's a lot of other people That would be willing to help on this and and indeed effectively to replicate the experiments the simulation experiments That you've been conducting or to play with the parameters that you're not interested in I wonder whether there's any scope for that but also I'm thinking about some of the The questions the other concepts that people might want to put in there and Coping with uncertainty and allowing different people to have different risk models strikes me as one of things that's possible So for instance when he talked about the decision points that come in the model where a new technology emerges And and you have to decide do I want to move to this new technology that uses looks cheaper Or do I stick with the one that I've got I Can see that one of the things that some people take into account there is New technology looks cheaper But there's a greater error bar effectively on what it's long-term possible be and I don't like uncertainty So I'll stick with the thing that looks more expensive, but it's more certain Yeah, and whether you can model that as well. Yeah, nobody got fired for buying IBM parameter Yeah Yes, I'm actually talking to Carol Goebbels team at Manchester about the possibility of taking this simulation formulating it as one of their workflows so that Everybody can can play with it and I'm going back there in January to talk to them some more about that right now Right now the code is in kind of a shaky state and People who know my code know that if I say it's in a shaky state it is in a really shaky state, but I'm hoping that the collaboration with the guys at UC Santa Cruz and and Stony Brook will Take my code and turn it into something a little more Usable in in a production sense So I'm we're hoping to have something together Sometime late spring maybe and Stay tuned because the definitely our goal is to take these ideas and Get them out there where people can play with them and Look at because as I say that the parameter space to explore is just enormous and It's way more than we can do on our own It's great news that you're talking to Carol then I hope something comes off that mm-hmm Any more questions? Okay, thank you all very much. We're really good questions