 I got it so that's why I'd beadesh at that but we shall see. Thank you for this opportunity to talk about my stuff, as I just said a minute ago, you're sitting at the back. You might want to move a bit further forward because there is some text and stuff, so if you're not really that bothered, that's fine. But you will probably be squinting at the screening a wee bit. I'm going to talk to about some work that we did over the last year or so I guess—it has taken a long time, i ddigital nhw'n iawn i phall phormat bach nhw'n i nifer yw phormat â... phormat yn gweld ei phormat, a how we converted them into HTML a hefyd yng Nghaerhwyl i ddim yn diwylliant a pob meddwl y proses. Llywed搵ig i'r gwybod i mi roi yng Nghaerhwyl. Felly ddau i'r hwyl i ni'r hwn o'r meddwl, ond du i ni wedi'u gweithio mewn i'r teulu cyfrydd o'r bethau ac nid o'n ddill wedi'u clirutoriol. Trin yn rhywbeth ffordd o ddill sy'n gwybod i wybodaeth. hanes, y drei dodionion yw am gweithio yn ei gynwylliant ar hyn y ffordd. Prydych chi os yw'n gwlad ar hyn a'i gwerthu'r ffordd, hwnnw chi'n gweld y technicol o'r byd o'r rhan fwrdd ar y Cymru ac yw'r gwerthu'r gwerthu'. Ydw i'ch cyrwng wedi'u gweld y gwerthu'r Gwerthu'r Gwerthu'r Gwerthu'r Gwerthu'r Gwerthu'r Gwerthu'r Geiru', ar y cyfrwng t widthio peynig, Cwestiwn 3. Yn ymweld, mae'n gweithio fydda'nylio'r cyrwynt Cwestiwn 3. Ar hyn, mae'n trefynu'n grwmpiaeth a mynd i'n gymryg, mae'n blosio beth i'n mynd yn dweud i'r bod yn gwneud. A'r peth yn gwneud, dwi'n cael'i gwneud, a'r cyfnodd mewn gwahanol yn dweud i'r gwasanaeth. A'r peth yn dweud i'r gwasanaeth, y gallwch gweithio i ddim yn ddim yn ôl am gynyddoedd. Rydw i chi'n gweithio chi arwraedd, I'm going to talk to you about what we thought we would do. I'm going to talk to you a little bit about the set, but that's kind of less relevant, I guess. And then what we ended up doing and how we worked with the curator and a bunch of tools that we made. And then, of course, finally some lessons. The primary question is, why do we choose WordStar and what we actually trying to achieve? What we're trying to achieve here was deal with content, which we would describe as being functionally obsolete. WordStar is a very old format. It's kind of around in the 1980s. WordStar for DOS was quite prevalent. It was a really new and exciting way of writing stuff in those days. There was a WordStar for Windows when Windows 3.11 kind of turned up, but it rapidly stopped being very relevant. There was a bunch of legal stuff happened and, of course, technology overran and it disappeared. So I can access them because I'm a nerd and I spend my life looking at this stuff. But if I gave you one of these files, you'd kind of go, I don't really get it. What do I do with this? It doesn't really mean anything. My computer says no. So we know that this stuff is what we would argue is being at risk. We want to do something with it. But, secondarily, what we wanted to do was start to test that assertion around what is a preservation action. We talk in the preservation community about doing preservation actions in this very meaningful and very kind of upright way. We'll do a preservation action and we'll save the content and then life goes on. But what we haven't done as a community really is delve into what that actually means and what that feels like and how do we make sure that we're doing the right things to our content. This was kind of a chance for us in the library to explore what that means and how that feels and how do we make sure that we're doing the right thing technically, we're doing the right thing intellectually and we're doing the right thing curatorially to content. So we were doing two things. One was actually word style because it was there and it has a need and the second is because we needed to start somewhere and it was a relatively low barrier of entry really. So the initial point of attack, we basically said, hey, what do we do? We'll find all the word stuff files. We'll find a converter on the internet because there's definitely one that will exist. We'll test it and we'll make sure it works because we're doing due diligence. We'll convert everything. We will show the curator how we've done this and how it's cool and how nothing's changed and the objects that we're putting into the system are still the same and then we'll just shove it all back in the permanent repository and life goes on and we can go and look at something else cool. That seems pretty straightforward and the way we would expect to do business. Of course, it doesn't really work like that. So we ended up having to do quite a lot of work and it became a technically quite an involved process. We were fortunate that some of my colleagues and prior colleagues, previous colleagues had some insight and they kind of collected some older kit. This is a laptop that I got to use, which is pretty fun. Toshiba Satellite T2130. You can see those wicked specs right there. 32 Mega RAM. That is smashing it. That thing doesn't work anymore. The power supply died halfway through this project. So I could kind of take it apart and try and fix it, but it kind of served its purpose and it kind of talks about the fragility of the mechanisms. We found Wordstar 4 on eBay and we got it shipped over for I think it cost us like $3.44 plus shipping from the States. So it ended up about 50 bucks probably and we got that shipped over and then of course we found an old printer which just so happened to be able to plug in and work with that laptop. There was nothing magical about that. We just plugged it in and it happened because some clever people did some clever stuff early on. So that was kind of like a way we could go back and test things, right? Because we've got the original, basically the original mechanisms that this stuff was created on. This is a scan photograph of a Wordstar content. This is just a spoof document I made to show kind of the things that we're dealing with and there is a pointer on here. So the application itself supports bold, italic, underline, tabbed characters which are the same as a space character visually. You can't tell the difference. It understands font to a degree so it allowed me to understand this was the original font. It allowed me to change the font and allowed me to change the font back again. So there's a bunch of instructions that we find inside the format which mean that we can tell you some things about markup and content. Cool. Okay, so we know there's a bunch of stuff which is relevant to us. As custodians of this content, as technical custodians of this content we need to kind of be aware of that, right? Because when I move this stuff into another format do I need to maintain the boldness? Do I need to maintain the italicisation? I think that's a word. Do we need to maintain the underlining? And one would argue probably yes. But we need a mechanism of establishing why is it actually relevant. Do we need to deal with the complexity? Douglas, a question. That's what that calls italics. So that's what it said. There's weeds and I can talk to you about them at a separate time and you'll see in a minute why that perhaps is relevant. But that is what it said was italic. That's just what happens. Also, we also have a bunch of stuff which is not visible on this slide and we'll talk to you in a minute. It's about things like line wrapping. So as a format, sorry, it said, I occupy 80-ish characters on a line and then I'm going to start a new line for you. As technical custodians we spend a long time arguing and discussing and thinking about the relevance of that line. Is that line a meaningful semantic unit in this textual object? Was it important to the creator of that document, a document, that the line was a specific length and the line wrapping therefore is something which needs to be preserved? And that's a question that we talked to you for quite a long time. Page numbering is another thing which appears in the format. It actually appears as a default function inside the program. It's nothing to do with the file object that was in my hand. So what do we do with that? What do we do with those pieces of information? This is what actually you see on the screen. And so this was a whole nother series of questions that we needed to look at. So this talks to your point a little bit, Douglas. So on screen this is what you see. This is normal text and that's what it looked like when it was printed out. This is bold text and we can see it's bold because it's got these kind of markup characters and this is what it looked over here. This is italic text. Now it doesn't work because perhaps there's a break between the proscript information that's pushed over to the printer and it doesn't actually fully understand what this instance of italic is. So what does that mean to us? Is that important? We do know originally on the original object that we see an italic instance has been argued for over here whether that's a renderance of an italic object is slightly not really relevant because we printed that and it's not really an actual artifact that we care that much about. And then equally on the underline we have a similar kind of markup. So one of the primary questions was what is important here? Is it this bit which is what the creator made and they sat in front of the computer and typed that out or is it the resulting textual object that came out the other side? Is that the bit that we're actually interested in preserving? Do we need to preserve this bit or is that a vehicle to get to this printed instance? And so we spent a long time talking about that and one of the decisions we made in this project was that in this instance we care about this object we don't care about that object. So we care about the application of those instructions we don't care that those instructions need to be manifested in a visual way. Does that make sense? So we kind of threw away these tags we used them behind the scenes but we said actually we don't need to show the reader these tags because they're not actually that relevant. We're lucky that we have an incredibly powerful preservation system in the National Library and so we used the preservation system to pick up our stuff. Throughout this you'll note there's a few times I've done some modification simply to protect the anonymity of the content that we have this is all oral history content. We've got 37 of these files it's not really a lot of content right this is a very small number of files but it actually means it's a nice manageable number it means that we can do something meaningful with that we're not kind of swamped with volume which is what rapidly happens going forward in the same oral history project so it turns out that all these word stuff files were written by a library colleague doing a report for the library which is to transcribe some oral library oral histories that we have so we've got the audio and we have the transcripts and these are those transcripts basically so that's kind of nice it's an internal thing we own some of the QA if you like about those objects so we used Rosetta to pull those stuff out it's super simple that's relatively straightforward so we got to this piece about checking the converters we went it's fine because some smart person would have done this for us would threat the converters and then we'll just test it and go yep, Apple still says Apple's underline is still underline and we found eight different codecs that worked and they mainly sucked and they really really really sucked and of the eight of them only two of them gave me back meaningful digital objects that I could engage with the other six either failed completely or they gave me back broken PDFs or broken RTF files that didn't open properly, didn't render properly and they just failed which had a scratch in our heads is it because our content was malformed is this just a function of dealing with stuff which is getting old I mean you know 86, 30 years ago the stuff was pretty old right is it because our stuff was malformed is it worth the effort trying to go back to these guys that are making the software and say hey your software is a bit rubbish we've got 39 files to test we don't really know we had to ask lots of questions around where do we put our effort where do we put our resource we decided that no single tool could handle the full set and as a result we're dealing with text by the time we'd got to this piece of work we were quite comfortable that we understood the format we could probably just do something ourselves in this instance and it would solve our problem and we don't need to deal with that is that probably the best way no, is that a sustainable way probably not but it meant that we could kind of keep going with this project I mean we could keep testing up the curatorial and the kind of internal process parts of our inquiry so this is what it looked like and this is the bits those in the back you will be struggling I apologise this is kind of like what one of those objects look like if you open it in a normal viewer and so you need to kind of convince it to open it in the first place because it's not going to natively open and this is on a Linux text renderer and you start to see all this kind of like funny gubbins up here like weird characters and this is actually the the markup that we saw before the underlining and the bolding we see line endings which is like we talked about line endings and whatever they're meaningful so you know intellectually it's all pretty much there and again I've made some redactions so none of the initials and locations are actually the same what they were it's just person A person B person C blah blah so you can see that like it's kind of mainly there but it's not quite there and this kind of when we got to this piece this bit's really easy to get to it's pretty trivial to get a relatively readable derivative of the content in a meaningful way but that's not what we saw on the bit of paper the bit of paper that we started off with that was intended to be printed didn't have these characters it didn't have these weird blocks over here so we know that we need to do something to it to kind of coes it into some shape so that's really basically what we're looking at the other thing of note as well and we talked about is it relevant to have the bolding and the underlining we start to know it's pattern so over here we can see we've got a person every time there's a person we've got this kind of funny marker which tells us that semantically they were using an underline function to to donate a person so there's a semantic inference around the asphetic componentry that makes up this file we know that that's kind of useful that's meaningful to us could we say the same about this we can certainly say that there's something about page endings but is it semantically useful is that page semantically useful sorry is the line semantically useful on the page is the page numbering semantically useful there is no I can't scroll down because it's not the full document but there is no page numbering inherent in this document but when I print it in Wordstar it prints me page numbers at the bottom is that a semantically inherent part of the object that we're preserving is it applied by the application does it belong in the format those are the kind of questions that we're arguing with and we answer those questions it's super important the lines are not that important because we know that we can do something meaningful in other formats and we can use the application to define them there's no semantic inference on the line what is semantically valuable is probably the paragraph so as long as we maintain that structure and that we do hardline breaks properly so this would be a hardline break as opposed to this being a softline break that's okay we're maintaining the semantic kind of flow of this piece of discourse so we allow ourselves that and we also said the page numbering is again not important because we're refloing this intellectual content into another page-shaped container the page number has become irrelevant because we're not controlling how many lines we show because we've already decided that the line width itself is not a semantic function of this document does that make sense so we kind of start to get in this really fuzzy weird area that says what's important and what's valuable we allow ourselves with this content to make that decision but we argue very strongly and there's a paper I'll point you to at the end that says every single collection needs to be assessed on its content you can't make these technical assessments without understanding the content itself so in this content we think we're allowed to do that other content if it was poetry if there was some kind of artistic license used in the layout then the layout needs to be preserved in a slightly different way so we did a different solution it sounds clever we reverse engineered the format we basically just kind of looked at it and reframed the tags we put it into a new format and we worked really hard on the look and feel and we'll get to that and I'll show you an example at that minute and the other thing was to give the curator an example of how to assess that transformation and make sure that it kind of makes sense and it was meaningful and they understood what they were signing off to the technical people, the nerdy people in the room were going yep we've done all this clever stuff what we want to do is make sure they're not going they're going I understand exactly what you did where I need to and that I feel comfortable saying yes this is an appropriate piece of work and you've handled it in the right way using it as you've handled this digital tonga in the way that I expect you to we get all the sign-offs and then it goes into the repository and lives forever we picked HTML because it was super easy so we're basically describing a marked up format we could have picked any bunch of formats and we had some really long and interesting conversations around what is the best format it's a textual object we could have used DOC we could have used DOCX we could have used RTF we could have used a number of other kind of containers we chose HTML because it allows us some really easy mechanisms of getting to that markup it supports bold underline it has font management it allows us to do this paragraph indentation and we can control these things in a really simple way and we've got heaps of HTML so if we have a preservation issue with HTML this is the least of our worries the rest of the collection that's HTML was probably something we should be worried of this is a kind of a view into the nerdy land of what we were actually playing with I'm not going to spend too long on it but these are what those markers look like so this is the bold marker it's actually a bytecode 13 it was actually pretty straightforward apart from when there were errors what we also found was the application did a lot of bit of self-cleaning so if the author didn't close the bold tags it wouldn't persist with boldness throughout the rest of the document it would go oh you've sucked a new paragraph you probably don't want bold anymore I'll switch bold off for you but it doesn't appear in the document it appears in the logic of the application so we spent a long time trying to unpick when we found errors when we did some of the early transformations when it would start with bold and the whole document was in bold we had to figure out why there was no closing bold tag does that make sense so we're trying to kind of reverse engineer and really know what it's doing and we have to try and use some sensible logic to decode that this was really the most important and valuable part of the project for me I think and we created this we got a text object criteria review sheet and I worked with a whole bunch of our colleagues and Peter McKinney put this together and we used a whole bunch of different assessment criteria from heaps of different sources which basically said of all of the aesthetic and inventory that you might find in a text object what is important to you as a curator of this content what do you understand what do you want us to pay special attention to and more importantly what do we not need to worry about so much and again you might not be able to read it from here but it talks about boldness and it talks about diacritics and it talks about figures and what have you and so we kind of said is it important not important and we went through and made comments and discussed with the curator and real great for both detail we decide and this is how we got to that bit around whether lines are important or not so the curators were comfortable that we're not just doing things willing elite of the content they're comfortable that we're touching content in a meaningful way and any changes or transformations that we make because we have to because technology changes are done in such a way that are defendable and that can be kind of accounted for over time so this form she took I'm not getting took hours I'm testing the process and part of it was making sure that everybody in the room understood the new implications of the little decisions that we were making along the way that's a nerdery we won't go in too much but basically we just use python because python is amazing and it's tech stuff and it was a pretty much a win-win I'm not going to go through that in any detail and this is kind of what we end up with so over here is the HTML this is never shown to the reader unless they kind of view source but this is equivalent of that kind of markup stuff that you saw on screen the author didn't write it to look like that somebody writing HTML writes it to look like this and over here you can see that we've got the underlining we've got and you'll see they've made a mistake they didn't underline that piece it doesn't exist there is no markup in that section that requires it so we have to persist with errors as they are one of my favourite conversations that we had on this topic was that of font when we first did this we used the default browser font I didn't touch font in any way, shape or form because in the text object the file format doesn't talk about font in any way, shape or form it just says I'm a text character the curator said New Times Roman though it's not contemporary we didn't what font do we want it to be I don't know because I've never seen it printed all I know is the text object that we have and we had this really fabulous discussion which resulted in us using cwria because it feels contemporary and one of the justifications of what we said was the default in the program and we looked at a number of different instances of the word site program always defaults as cwria of that day, it's a contemporary decision the default settings in an application unless we know better will probably be used by the author if they really cared about font and we did some testing around font and I showed you that earlier we would have seen some clear indicators how we've used a different font so that's what we resulted in, this is HTML and this is that stuff moved 30 years into the future we also built another tool which I'm not going to talk about too much because I don't really have the time but we wanted to give the curator a method of testing the veracity of what we've done this tool we built which basically strips away all of that markup and puts in the original and the converted just the words without any of the markup and it was designed to break the moment you have a so this is red and that says read the moment there's an intellectual discrepancy with the text content it would break and tell you there's a problem here and we gave that to the curator with all of the converted materials and said this is your opportunity to go through behind the scenes we do all of this logging anyway because we're nerds and we do that kind of stuff so we know that everything that comes out is perfect and if it's not perfect we know why it's not perfect but that's not my decision as the content owner is the content curator to make that decision and I needed a vehicle to allow the curator to make those kind of meaningful decisions and not just trust me that I've done a good job so that they can go through and look after wade through this content in a structured way so it turns out this tool was super useful so in wrapping up the lessons that we kind of picked up where you need to know your content and I really do me know your content technically aesthetically and informationally you need to know what it's doing what's the preservation intent of this object what is it doing for me what's the technical gubbins under the hood we know what every single byte in these objects does which is awesome but it's kind of like nerdy and not many people really need to get down to that level so we need to make tools which allow us to bridge those two divides and allow people who care about the aesthetic and about the information to be hidden from the horrible technical bits and allow the technical people to do their job and the curatorial people to deal with the aesthetic and the informational part and we all kind of come together and agree what we need to do trusting into processes that was another really interesting one we kind of naively thought we could just pick up these converters and then deploy them and then they would work and it turns out they don't work and we don't really know what they're doing so how do we trust what they're doing they're black boxes, you have an input you have an expected output but if you don't know what's happening inside how can you be sure that it's not really messing with this stuff and that's a situation where some of the first HTML we produced when I said this conversion tool can make HTML how about I just squirt in another bit of HTML which changed the font that was heresy at that time you know the tool did it why would you want to touch something a tool made and I kind of said but we know what we're doing as technical people we're not adding any voodoo we're adding another bit of knowledge but the tool was trusted more than it probably ought to have done so this trust in processes is probably better and then the final things which is kind of standard in text space if you make things make them reusable because they'll pay off quickly all of the IP that we made for this project we've now got two further projects which we're just waiting for the kind of a start date to just see how quickly we can do it this was an exercise in really seeing what we needed to do now we know what the process is we've got Wordstar 2000, we've got a bunch of them and we've got some electronic typewriter material which also similarly you know actually all of these processes and tools are identical all that's different is that kind of nerdy conversion of the weird markup it's irrelevant to the curator what's relevant is that they get the aesthetic object and the informational object that they thought they had so that was my talk, thank you very much for your time any questions that is a great question and I'm not going to give you an answer that you're going to like nor think is useful it really depends it really depends on what you're measuring in terms of actual work time we probably could do it, I would argue you know take a person week roughly to do the technical analysis we understand this is a relatively trivial text format, again let's be very clear it's in a very simple space to make sure we understand it fully to do a converter, we're using the IP we've already made I reckon it'll take us about a week for each of the other two new formats that we have this actually took us the best a year on and off because we really sat down and chewed on the problem, we wanted to take it slow we wanted to bring the curator staff with the technical staff on a journey so we purposefully did it in a laborious way so you can do it a hell of a lot quicker the answer is don't know we've never really done it as a let's do it properly, start to finish, get it done but I reckon the technical bits you can do I've estimated about a week for each of the two new formats that we have any other questions yes, yes my life will be so much easier, yes that is an excellent question and I would defer you at that point to policy people, I would defer you to your collections policy people and I would say that that's a decision of the curator, if the curator says I'm comfortable that the print matter is all that matters in this instance and we can, the digital part is not that relevant, then that's their decision that, what I'm saying is that's not on my desk, don't know any other questions I would hope so I really, really hope so I'm not really a programmer and so everything I build is a bit rubbish and I would feel very nervous about putting it in the public space anybody can ask me for and I will give it to them what I don't want to do is let it go without there are lots of kind of caution words around using it so the next time we do it we'll refine things and we'll make it better and so on and so forth and if other people want to use the same stuff we'll make it better, I mean that's entirely the model that we need to have it's not sustainable to do it in this way that's primary lesson number one so it belongs out in the open space and we want to make tools which are simple the reason we wrote the tester was exactly for that reason, I could have shown the logs to the curators they would have just got frustrated they're not easy to use so I really think it's definitely a good thing and we should be doing and the more people that ask us to do this stuff the more it's likely to kind of happen I guess awesome, I've probably taken more of the time than I old I know, one more question just speaking as one of the policy people I'm just wondering about sharing IP when we talk about this offline sure absolutely, I'd love to no worries, thank you