 So my name is Tom Hutchinson. I'm an engineer with the tri-college libraries. That's a Brynmar-Harriford and Swarthmore colleges There are three small liberal arts colleges outside of Philadelphia All right, and so today I wanted to talk about preservation The the technical aspects I know preservation is is largely a non-technical issue It's organizational and financial Still there's many, you know technology related decisions And even if you're following the the strictest guidelines, there's still it's not clear how to do everything There's also pretty fuzzy boundaries some technology decisions involved in preservation quickly bleed into other areas, you know certainly Repositories come up very quickly and then also just your kind of technology in general So I'm gonna lay out a Few just simple ideas and try to motivate them and then apply those Like simple principles to some situations and see see how far it gets us So my main underlying theme is just the UNIX philosophy There I'm a You know big big idea they have is let's make Solve problems as broadly as possible With with very narrow solutions That are constrained they do one thing and one thing well and hopefully it's even simple and that not just simple to use But simply implemented They're big on good is better than You know all-encompassing Let's just make something work pretty well So this this narrow constrained problem We see in a tool like grab You know, we're we're searching text. It does this kind of pattern matching really well, and that's it nothing else But it's also we can apply this to any text. So it's not such a bad utility We also see this, you know in more our world with something like fedora You know, we're solving a pretty narrow problem. It doesn't do too much. They're very disciplined about Not adding things. So you most people have some other software sitting on top Because we're just solving some some simple problems and we're composing those solutions together And we can see this compose together simple pieces in your regular web app so most of our apps have a web server like Apache a database like Postgres a search engine like solar and So, you know now we're we're getting out of the realm of solving some kind of library problem Some sign of preservation problem, and we're just you know Farming it out to just all people who need to store and retrieve data, you know, that's a big win You know, we can just use these these basic building blocks and kind of plug them together and that works, you know pretty well It also lets us reuse our skills This is this is how we're set up. We actually try to line up all our applications So they have a lot of overlap and so the application specific stuff. We try to keep really small And this this works out in practice, especially we're a small team. So we're using some some solutions works out pretty well Okay, so another principle is there's no magic You know the the sales people they try to tell you there is you know, we've got a fancy thing But you know, there's there's also no magic. There's a lot of hype, you know fashion is big Oh, did you you know see this new thing? Let's get into it. There's a lot of excitement around certain products Also, you know people can fall into a trap of thinking Oh, there's one super technology that'll solve all our problems You know, we can we can fill out the matrix and it checks every box. So it's gonna be the super thing keep Let's just add more stuff in it You know in in practice, you know, I argue that's not always the best way to go so Yeah, now I'm gonna have a side ramp on the cloud Because this is coming up and a lot of our technology decisions is You know local hosting cloud hosting We're a consortia. So You know, we have three different IT departments one we the libraries work very closely with but the other two They're still, you know, very present at our colleges You know, and I hear a lot of Of different reasons for for using the cloud I You know, no one ever gets fired for putting something on Amazon, you know, but it is a good line item to say, oh, I You know directed us in our transition, you know to the cloud, you know, that sounds pretty good You know and and it's not it's not bad stuff you know and a lot of cloud usage comes and just a lot of our Technology, you know enthusiasm is coming out of this big startup culture. We have the tail end of a big tech bubble boom And there's a lot of enthusiasm, but I cringe every time I hear a library should be run like a startup That is a garbage idea You know, we are so much better than sweatshops And we have things that they don't have You know, they're they're just forces behind them are just really different. So they Start with nothing literally maybe a few people maybe a little money They have no code no staff no Nothing and they want to get big they want to go from really small to really big so right from the start They're thinking scalable So they're going to use weird cloud technologies, but the because the cloud is weird things are different there We're thinking how to scale really well And so, you know, we don't have a normal regular database We don't have a regular normal file store. Everything's all kind of weird And but you it's kind of a part price you pay to be able to scale. So for right out of the gate They're building an app from scratch using scalable technologies and they're willing to pay the price And they're really it's great for them because they can start with very few resources And so their costs are low in the beginning and then they want to be scaling up a good one, you know Exponentially for a while You know, and that's just kind of different, you know our needs are much more stable, you know, we won't have extra capacity to You know take on a new big collection that drops out of the sky, you know, that's always good We want to be able to to spin stuff up and try things out but You know, but we're not scaling up a thousand times and then another thousand, you know, just I don't know. Maybe you are but our needs are much more modest You know the cloud so for a startup it works. It works great It's these are good technologies to be using for them. And it's like a good fit They really need a lot of Dynasticism and they're willing to pay a high price for how they architect their applications and also how they orchestrate because they want that scaling to be automatic and dynamic and So there's a they're willing to put in the cost of a lot of tooling too You know clouds also really good if you're running at the data center level You know your Netflix and it might make more sense to form that out to even a bigger player You know, but you know, I don't know that's not me So let's go for our recent decision that we're making in the tricor You know, I want to this the principle here is use what you have so we were Looking and evaluating. Oh, what do we even have? You know, because we're thinking about all let's do maybe Amazon or you know, one of those other guys or you know But let's let's kind of evaluate what we have internally Well, we have land and server rooms power cooling redundant high-speed internet Servers a virtual erasing stack storage hardware a backup system You know kind of a lot of stuff You know, we also have staffing You know, we have IT staff and they keep our infrastructure up There's the basics like network availability in 24-7 We have a little web development team. It's not exactly what we do but Close enough We also have a lot of librarians You know, they're super knowledgeable and excited and want to do stuff How can we take advantage of all these these resources both the physical infrastructure and the people So for us we were looking to set up a new repository and set up some preservation And so initially we thought like cloud cloud cloud cloud But now we're thinking. Oh, maybe we should like Double down on what we have, you know, local infrastructure and local management and Save a bunch of money But still have the capacity and nimbleness to spend things up toss storage at problems You know, we can we can still be dynamic up to a point So and we found for us since we've already have these other very high fixed costs that we're gonna pay no matter what The additional costs are low compared to the You know we're thought to be quite low cloud prices You know actually also that's a point is we really want to take advantage of The pricing war on the cloud Especially on the storage side. There've been a very aggressive price wars And most recently, you know oracles get into it and they were trying to undercut and you know You get the Google and Microsoft and Amazon all being very aggressive on pricing and having deep pockets to do that And so, you know, maybe we should take advantage of that, you know cheap storage. That sounds pretty good But we should watch out have an exit strategy. What about, you know, maybe the prices probably are gonna stop Dropping radically, but you know wouldn't be so wild if they Rose in a similarly fast manner as they fell and will be will we be ready then, you know that's one one thing I worry about with cloud stuff is If you don't use it like a dumb commodity and you start using all their fancy tools, which are, you know, great technologies You're they get really have your hooks in you and you're really locked into that vendor and it's very You know expensive to switch out technology wise So Yeah, so we're gonna expand out our local storage and then have another copy of our data locally And then have that backed up in our backup system. And then we're gonna stick in the dark cloud a preservation copy of our assets and metadata But we're you know for us it made sense to to go local heavily You know in my vision for the the next level of what, you know, local infrastructure looks like is Not just doing this with one IT department that we're working closely with but we have three colleges We have three to IT departments. What if they each had a copy of our data? This actually got a lot of pushback because of NDSA guidelines, you know geographic dispersion our colleges are close together You know to me, that's kind of overly rigid if I Don't know the greater Philadelphia area is destroyed. I mean, I don't know. Maybe that's a bigger problem And we can still hedge by having these cheap Dark copies off-site. So that's that's what I would go through. It's like each each college have its own copy of our data and Then we have some kind of cheap off-site that we don't use that much Also this doing local storage is great because we can do all the access we want We can do all our preservation activities without thinking about how much is this gonna cost to check our our files So that can be a continuous process For free, which is pretty nice Okay Shoot I had a couple question and answers Someone maybe those are further anyway one question I had was Where should everything live we're trying to figure out Where should preservation be? You know, we were thinking about preservation storage we're thinking about maybe the repositories You know what? Or should that stuff go? Let's see what I have And so with respect to Built into the storage You know, that sounds pretty appealing You know, we're trying out we're trying out dirt cloud. You know, it's a good product There's there's other ones out there that you know pretty nice. There's other ones even more featureful You know dirt cloud. I also likes it has good integrations with a d-space and fedora, but To me I'm not so big on that because I want storage to be a dumb commodity You know, I don't really want to care what it is specifically and want to be able to move Easily across across any kind of storage So, you know, to me when I hear preservation storage that sounds expensive And not letting me choose whatever storage technology I want. So it sounds restrictive You know, I was also thinking about how about in the repositories There it kind of makes sense, right? Well, you know, I don't know, you know, I I have a lot of different repositories And I also have lots of files that aren't in repositories So, you know, I do highly support a repository supporting preservation operations You know, you should be able to say, oh run the fixity checks on all my assets, you know, that should be, you know, just press a button So I I'm not against it going in there, but Ideally I would have preservation as its own separate layer that plugs into repositories plugs into bear files and That could, you know, one central place to drive those operations and to do organization And this doesn't even have to be a specific technology. It could be a spreadsheet and some, you know, scripts or something You know, DIY gets you pretty far You know, we get we get look how far we get with Hashes and backups, you know, that's not a lot of technology, but we can kind of do at least this losing the bits part of preservation, you know, not everything else, but So so I like I like to be able to just plug into our pools of storage plug into our repositories But have just one one place and really I'm You know, one thing where I see a lot of places grappling with is you maybe they've been doing preservation for a while, but They haven't necessarily gone through multiple generations of a preservation system So how to to keep that going Through different repositories as a for instance You know, so I think that Keeping it as its own kind of separate thing is isn't the worst Okay, so now this is the ramp portion of my talk Hashtag Ramp for Lib So just one one thing I think is to build bridge people, you know, a lot of these Navigating the preservation waters You know, I've had these different camps of the the IT folks and the librarians and Really, they're oftentimes I find they're talking about the same kinds of things But you know, they have sort of different languages and You know different levels of comfort with each other So just building up these bridge people who can kind of communicate what we're trying to do is is crucial That's way more important than you know, our specific technologies Another little tidbit is plan for anything failing, you know, we're have a release Bought out of you know, like what if a hard drive fails? Okay, we got that covered or you know, what if one file gets corrupted? Okay? Okay, no problem, but You know think about other stuff, you know your checks on that's a file, too You know, are you gonna be okay if that goes bad? Will you be able to realize? You know, do you have some other kind of metadata a database or some headers somewhere? You know, just be on the lookout for just any particular component failing in your ecosystem And hopefully for a bunch of them to fail at the same time because that's what happens This is all my ransom ready, I guess, okay Okay, here's a question I've seen come up You know in Amazon, you know, aren't they doing hashes on hashes on hashes isn't that preservation? Can't we just like do that or isn't like a checkbox we can click somewhere on something? No, absolutely not that is not preservation You know, we should pick technologies that prevent errors, you know, so I would pick a ZFS I pick an error correcting memory, you know, I pick an S3 But at the end of the day, I still think about them as just dumb holders of files and I think of them as fragile You know, we want to do everything Independently I want it tangible to be able to hold on my hand You know, so it checks I'm built into the file system. That's that's good You know, we should be using those modern file system technologies But we want to do all our preservation Ourselves it should be independent and verifiable, you know, we want to leave a good paper trail Okay, another rant I'm talk to your CS department You know, they've a lot of the problems involved in preservation, you know, they've been thinking about from before there was physical computers They just thought of the idea of of these problems, you know, how how much information can you store? How do you fix errors? I'm just the simple stuff. They, you know, the computer science electrical engineering They have a lot of experience And they call these things Things like coding theory, information theory, Claude Shannon, people like them You know, so these were things like I don't know 40s, 50s, 60s kind of problems that were solved And put into practice in software often, you know, then or but maybe 20, 30 years So it's not, you know, groundbreaking stuff Another opinion, you know, know your materials just like if you're working with a piece of paper and doing conservation Um, you know, we work with text a lot. Oh, you might want to know a little bit about text. It's not that hard You know, figuring out how to debug the tofu problem. You get that little square tofu character like, oh, what happened? Um, well, you know, we can know about like ASCII and those other kind of regional encodings I recommend knowing about utf 8, you know, just those two get you very far Even just knowing a text encoding exists. The other kind of text encoding, not the librarian kind, but the At the lower level Yeah, just even knowing that's a thing is this very helpful We can also up our game in fixity checks You know checksums, hashes, those are good You know that plus backups kind of does a lot You know, it's not perfect though So a problem with a checksum is You know, how can you tell if it's bad, you know, it doesn't match the file Which one is the bad one? I don't know. We look for the backup and compare Hopefully a couple backups and we do a consensus But actually we can we can have a Kind of, you know, run a big file through a program and it spits out a little code and We can have properties like oh, we always know that code is well formed or not So we would be able to detect if the code is wrong We can store redundant information in the code So this is called error correction so we can have kind of self healing files So it's not You know, this is a big problem, you know With a physical piece of paper you tear the corner off you can still read the rest But a lot of times with a file you mess up one little part. The whole thing is bad Well, we can store redundant information our code can be a little bit big So if we store 10% redundant information We will be able to recover from 10% disappearing And if it was more than that we would still be able to detect it had gone wrong You know, I I think the The codes we really want might not exist yet, but there's very usable software right out of the box today A bunch of people are talking about p archive or part two Last year Jeff Spies did a talk That was one of the technologies he mentioned You know, you hear it whispered in you know, maybe on pacing or So this is you know, 20 year old software and 60 year old problems So we can use that right out of the box and the same kind of technology that's in raid or zfs or an error correcting memory Where you have just a little redundant information. So if some of it goes bad, we can kind of self heal All right, well, um, you know, I have I plenty of other, you know, rants and hot air But um, you know, I like to open things up for for questions Thank you