 A unigaf. Thank you very much for coming to my talk. My name is Jacob T. So, as Amy said I'm a digital preservation analyst at the National Library of New Zealand. I've been there for five years. What I wanted to talk to you about today was some work that we've been doing over the past couple of years, three years. We just really have got not much to do with digital preservation on the face of it. But really it's about trying to get content into the collections in a sensible, controlled and manageable way. So there's a lot of water theme in my talk. So if you haven't used the facilities Is that might upset your sensibilities? Perhaps quickly scoot up now. If this works... Does this thing work? Hello, where am I pointing at? Maybe just click that. There we go. So I'm going to talk to you about the physical workflow. So this is what we did with content up to 2003. And then I'm going to talk about how that's changed as we got the changes to the National Libraries Act in 2003. Talk about what an idealized workflow looks like ac yn hyw fod pobl ydych chi'n cymdeithasol fel allanion y gallwn a gwybod yna'r ddysgu bethau ac rydyn ni yn dda i mi wneud rhywun cyfan y gwrthoedd gwell gwybod yng nghy NVG. Yn y cwrddain sydd hayddo i'r mlynedd maen nhw, ac mae'r drwm Ym Mwais Oedda i'r diwyll Yn Moed i'r rhaid, ac roeddwn i'n gweithio'r ysgrinnyddu i'r ffordd a'i i wych i eisiau gallu fynd yn ei wneud. funer Ac mae'r lyf yn rhoi'r adeg or幹 ni'n llwyddiad yn gwybod y biwyd. We all have the Internet. We all have these computers and amazingly powerful machines on our desks that's capable of doing wonderful and great things for us, and it seems like sometimes, we plodd through the mire. We are not kind of harnessing the power and capability that the Internet and this kind of streamed resource gives us. So, my kind of proposition for this talk really is, we are at the water. We have the water available to us but can we actually drink it, and obviously the water being this digital content. Rydyn ni'n gweithio'n gweithio eich web. Rydyn ni'n gweithio eich web, a hynny'n gwneud o'r sgopio. Rydyn ni'n gweithio eich web. Rydyn ni'n gweithio eich web a byddurio'r content yma. Rydyn ni'n gweithio yng Nghymru. Rydyn ni'n gweithio eich web, unrhyw ymddangos o'r diwrnod o fwy o'r maen nhw peithio. Roobot ffaith, mae gennym rhai greu yn y cyfeidliad. Mae gennym i'n gweithio, r�od i'n gweithio, rydyn ni'n gweithio eich web. ond arrydd, a gweithio am gweithio a gweithio'r bydd ymweld neu'r wneud ymweld. Rydyn ni'n gweld ymweld. Rydyn ni'n gweithio fod ymwyaf i'r gweithio arnau gwell yn y dŵr y canidol ar gyfer mae'r wneud yn mwy o'r cyd-dŵl cerddoedd ymweld. Rydyn ni'n gweld ymweld i'r gweithio ar y cerddoedd a gweithio. Felly, rydyn ni'n gweld bod yn gweld yn y gweithio'r cyd-dŵl cyd-dŵl. ac yn y gallwn gyntaf, mae'n datblygu yn byw sy'n rhan i wneud eich gynnig yn ymddangos. Felly, rôl i'r cyfweld y dda, rôl i'n ei wneud ar gwrthodd ac rywbeth arwyd. Yn ei wneud, mae'r cyfrifesio cymryd ymddangos. Yn y bydd yw, mae'r cyfrifesio ac yw'r cyfrifesio yw'r cyfrifesio yw'r cyfrifesio. Rwyf wedi oes i gwneud ar gyfnoddau ac rwyf i'n mynd i'n fwy o'i gweithio. yn ymgymell, yn ymwinell y ffordd ac yn ymwrthol, yn ymwyaf a'i gynnig iawn, i wneud y bwysig i'r Bibri. Felly, mae'r effeithio mynd i gefnogi arall yn cael y cyfnodol cymiannol, a'i rhoi gyda'r ysgrifennu pethau, ac yn y prins verso, mae'n ridei'r dda yn bobl. Felly mae'n procesll fel cynyddu'r ymwenni, mae'r dda yn cael ei dda, na sicrhau ymwyffiol ac mae'r dda yn eich llwylo sydd yn ddechrau. If it's okay, we bring it in, we describe it, we analyse it, there's a big difference between these two and I picked those words quite carefully with lots of discussion with colleagues around the library. The describe is that really, really bland stuff, it is a book and we can tell it's a book, it's got N pages, it's written by an author, all very accessible bits of metadata. The analyse is the way deeper kind of indexing the kind of conceptual stuff, it's about dreams, it's about things, blah blah blah, whatever. That bit's hard, that bit's kind of easy when it gets the digital stuff. We put it on a shelf and then tada, it's available, it's in the collection. So that's kind of the traditional model, that's kind of what we've worked with and we still work in the physical space with that model. The digital has an interesting difference, the primary difference is up here. We have an obligation to kind of pull now, instead of it being pushed to us, we have to go and collect. So instead of it being given to us, nicely packaged and by the content creator, we kind of have to go and visit them in the online space and do something meaningful with their content. That it's made available to us, which I believe is the wording in the legislation, is all that we kind of have a mandate for. So we have to go the other way, we have to go and collect the content. We still do this quality assurance piece, we ingest it which is the kind of nerdy way of saying receive it, we still describe it, we still analyse it. We don't need to shelve it because obviously the shelves that we have now are slightly different and then of course it's made available. So that's kind of where we are, that's what we're doing and I guess my argument here is we could really do some magic with some of these steps and we could automate some of that stuff. We don't really need to be doing it, we could build tools that will go and collect content for us. We can go and touch a website, we can touch an FTP, we can read RSS feeds, whatever we need to do, we can do some of that rudimentary checking. I'm expecting 10 issues, I've got 10 issues, tick, move on. We can ingest it automatically so we're very privileged at the National Library to have the Rosetta platform. The Rosetta platform has a whole bunch of APIs under the hood, which means that we can kind of structure some content, throw some metadata around it and disappear magically into the system and kind of it feels like nobody does any work. Some of that description stuff can happen. We can look at the metadata of a PDF file, we can pick up keywords, we can pick up, publish your lead bits of data that describe our content. I'm turning into a Kiwi, I said data, I've been here for five years, it's taken that long, I think that's the first one I've ever done that. That analysis still has to be done in a human way because it's still hard, it's conceptually difficult and perhaps in the future it might be done by these clever robots but for now that's perfectly okay. That's the really difficult bit, that's why we pay smart people and people that have a really good feeling for this process to do that job. So let's save the difficult jobs for them and give these kind of dumb boring mundane jobs to scripts ostensibly. So I guess really the question here is how do we get that magic? What is that magic? What does it do? My argument is it's not really magic, it's just about having an organised mechanism of thinking about what do we do and how we do. He's quite captivating. So I'm going to give you some examples in a wee minute that's going to talk about how we're trying to tackle that. We're trying to use proof of concepts, we're trying to think about digital and how we think about collecting and the building blocks that we need to put together to allow us to do this stuff in an automatic way. So we have some basic steps and if we think about that workflow we have to go and visit and collect. We probably have to walk through some structures, we have to pick up these kind of binary objects and we want to pick up metadata along the way. So if the publisher is very kindly generating, aprici is generating keywords, is generating summary, generating all the dates, why don't we just grab it at the same time? Like why are we redoing this stuff? We want to package it so we can put it into our system. So if I can somehow bubble all that data into a nice structure, give it to Rosetta and then Rosetta knows how to populate its fields and then we also have our catalogue. So again can I package it in such a way that our catalogue goes, I know exactly what you're telling me and in we go. That feels like the right thing to be doing. And then there's an informed stage where we can't just fire these things off blindly. We need to know what's going on. You can't just kind of assume that it's all happening because if you don't do that it suddenly stops happening and then you realise you've got a backlog to address. So we need to have this mechanism in the middle that says, hey, I did a job for you. This is what I discovered. I've put it over there. If it went wrong this is what might have gone wrong or this is where I got to. So those are the kind of the basic steps that we kind of got to when we were thinking about what did this magic, automagical process feel like. So this is a whiteboard. I love whiteboards. I am truly a nerd. I have whiteboards in my house. That's kind of how I like to think about the world. So, you know, this is a snapshot of a meeting where we sat down and we were talking about, okay, what does this process look like? What are the steps that we need to put in the way to give us the ability to do something in a clever way? So part of the thing for me is that that ability to have space to work in this way. And so, again, I do genuinely feel truly privileged to work at the National Library and be tolerated and given the space to think about some of this stuff. Hopefully sometimes it does generate benefit. We need resources. You know, you need to have machines that can do this stuff. You can't do it with a wet piece of string. You need a computer that can talk to the internet and do clever things. You need supportive management. Sometimes you need some real kind of bloody mindedness. It takes a while to solve some of these problems. And I always kind of come back to one of my favourite little sayings, which is a bit dumb, but in a meta way it works. It's quite okay to have stupid ideas because if it works, it kind of wasn't stupid. So there's a lot of that glue in the things that I think we've been making. We can turn that stuff, that whiteboard stuff, into kind of formal projects. So, you know, we can put the paperwork in. We can give it a name. We can engage resources. We can get devs. We can engage with our IT provider. We can put bits of infrastructure down. We can make sure that we're figuring out our assumptions and make sure that we understand what they are. And we can also make sure we understand dependencies. So for me, that's a really interesting part of this problem. If we don't frame these automated tools in the right way, we end up missing some of those dependencies. So when they go wrong, they can go wrong spectacularly because you think you're doing all this clever stuff and in reality nothing is happening. So, you know, those dependencies are super interesting. And there is this trade-off for me between researching the automation versus just getting the job done. If I have 10 objects, it's probably not worth me writing a crawler to go ahead and grab those 10 objects. It's probably much quicker for me just to go ahead and get those 10 objects manually. So the question for me is, where's that trade-off? How do we get a feeling and a sense for this collection is worth processing in an automagical way? And this one, just get on with it and get it done and over the line. And I think there's a nice lesson there for us. So I started to think about different sets and the different types of collections that we've been addressing. And I'm just going to skip you through some different classes of collections that we've been working on. So one of them is neatly shaped sets. So what I mean by that is they are well prescribed. They have an inherent structure which is easy to access, they're very easy to consume. So, you know, a binary and a piece of well-structured metadata. Arguably when somebody else has done the hard work for you, he says looking in the DNZ direction. So one of the projects we did was to have a look at a DNZ title that had been digitised and figure out whether we could just write a very rudimentary crawler that would go and pick up the DNZ data, turn it into a depositable object, and then pop it into Rosetta. And it turns out it's pretty straightforward. So we wrote that. We picked up the Public Service Association Journal and its sister title. So we picked up, like, a thousand issues. Limited human interaction. It took a couple of hours to write the script and test it in a way it went. About 100 gig of data. And it's kind of done. That was easy. That was kind of a nicely bounded piece of work. The lessons for us there were one person's clean data is another person's broken software. So I make these assumptions that the date is going to appear in a certain form because I use date to generate designation and all sorts of other library stuff. When we have the date is September and October, my parser goes, I don't really know what that means because I just want September or October. So, you know, there was a very nice negotiation with DNZ and we kind of talked about how do we make those things cleaner and we kind of clean up that conjure. So that's very cool. A bit of minor tweaking in a way we went. So it was really good and actually one of the good lessons for me was that really nice clean feedback loop. So having DNZ in the building and being able to just talk to them directly was an incredibly fruitful experience and I'm sure there's lots more gold in that hill. Differently shaped sets. So this is kind of where we've done most of the work. We've been doing this for a wee while and I'm just going to skip through very quickly, I think for examples. So we worked, we did a test run at a YouTube playlist. So Auckland Museum have an audience with a playlist set and the question was can we go and pick that up in a smart way and the answer is yep, no worries, we can do that, it's easy. We got 90% of their traffic, I forget the numbers but it was a crazy amount of data and we can pick up the comments if we want it. So the structural form is different, it's not a website but it is the data that is driving the YouTube website and that was only possible because YouTube opened their API and allow us to kind of annoy these things in quite an automatic way. We had another really fascinating one with the NZLII which is the Legal Gazette and again the short story there is that we picked up 34 gig of data, 77,000 files which covered off 782 titles. We picked it up in about three days. There was very little human intervention once we got the thing going and we just have this nice bundle of data. There is a giant problem with that and I will come to that at the end of this little piece. We have another publisher, Dove Press. Again, they are responsible for about 130 titles. We have done them three years now and so each year, the first year, it took about three weeks. It was the very first one we did. It took about three weeks to write the script because it was kind of like learning to walk a little bit and my favourite part of all of this story was when we run it last time, we just kind of went let's dust off that script, hit go and then a day later it finished and it did exactly what it was supposed to do so it was quite rewarding. So we have some lessons. The lessons are we obviously need to talk to content energy. It's kind of a bit rude to go and annoy somebody's API and kind of take all of their stuff because when things go wrong and then you phone them up and kind of go, hey, your API doesn't like me anymore. I think you're blocking me and they go, yeah, we are, what are you doing? And you go, I probably should have spoke to you before. Sorry. But once you talk to them, it's fine. So actually that conversation is very good and very fruitful but they need to kind of, you know, it's a good way of having the conversation about requirements and legal deposit and what the National Library does and digital preservation which is my actual job. The other thing is bottleneck. So really sometimes all we're doing is we're moving the bottleneck on. We know that the bottleneck at the moment is collecting the stuff. We've only got so many people with hands and mice and keyboards and monitors so that's kind of the bottleneck. So if we can move that downstream, the bottleneck now is some of the collecting colleagues now have big giant folders of things to think about ingesting. So it doesn't solve much. All it does is kick the problem slightly further down the road but that's fine because we spot the next bottleneck. We start to build ingest mechanisms which allow us to automate that piece and that will move the problem onto the cataloging side and that's somebody else's problem. The idea here is to keep trying to move the bottleneck away by understanding what the flow looks like. And then finally this conceptual change is really, really interesting. In the traditional space a journal is a thing and it's bound and it's pages and paper and it has a chapter and a title. What we're increasingly finding is journals have attendant things so an article may have an embedded YouTube video. In our kind of logical structure of how we regard intellectual entities, we start to have to wonder about how do we represent that in a sensible way? What part of the CMS record do we attach things to? Who owns the IP of these things? If we can't successfully pick up something because of DRM problems, does that mean that the article is incomplete or is it an attendant piece of the object that is incomplete? So there's some interesting questions that that raises for us and again, I don't think it's a problem, I think it's a genuinely good thing. So that was kind of discrete sets which are bound and they're fixed and they're fixed in time. The next thing that we're interested in is stuff which is streamed, so that constant drip of content. So the very first one we did was the agency is the GIS who are again cositing our building and we did this fantastic project with them where they changed their publishing paradigm for the Gazette. So the Gazette is a legal instrument, they publish it every day, notices, every day there's between one and 100 notices, they have about 8,000 notices a year and it's kind of part of the formal construct of the government of New Zealand. They've completely changed their paradigm so they're going away from print, they don't want to do print anymore, they want to just do digital. And that causes us some interesting questions around what do we do with legal deposit because we need to be able to collect it, we need to let them fulfil their legal deposit mandate but we don't want to be a giant burden, we don't want to kind of cause a problem. So we sat down in a room and we just kind of nutted it out and it was awesome, we sat down and looked at their API, it turns out they've done some really really cool things that was very easy for us to write a crawler. So we just fire a crawler once a day, it picks up everything they published yesterday, packages it nicely in theory, deposits it into Rosetta. So all of that kind of works, apart from the bit that doesn't work, but that's dirty laundry and I should probably leave that over there. But in theory it's a beautiful and wonderful thing. There are some things we need to kind of get over. We find that the technology infrastructure is probably where we have the most problems. As you start to do weird and wonderful things, the corporate space doesn't tend to like what you're doing. But it was great, we influenced the API, they're very happy, they're very interested in that ability to get their legal deposit mandate and they get a record. So my script emails them every time and says, hey we collected your stuff, this is what we've got and they can go great, we're comfortable, that our content has gone into the library and that's all we need to worry about. So it's a nice kind of closed loop. This one is the biggie. This one really for me is genuinely a very very exciting opportunity and so what we're doing in this one which is a big scaled set is working with Fairfax Media and working with their content and we have 21 titles. We through some very very complex, long and fruitful negotiations and conversations over the years we have established ourselves inside their print pipeline. So we have an FTP server which is part of their publish, thank you, part of their publish routine. They hit go, they send their pages to the printers and at the same time I get a copy on the FTP. We then just build a bunch of scripts around that issues, builds the things and so we can just pick up yesterday's newspapers, we can package them and we can get them into Rosetta again relatively automatically. There is very little work being to go on there. It's a pilot scheme so we're just kind of figuring out what that means and there's some quite challenging intellectual questions around the cataloging and indexing space that this kind of opens up which is fabulous. I think it's a very very good thing. So in the six weeks trial that we ran we picked up 12,000 issues, sorry 1,200 issues and supplements. It was about 31,000 files and it was about 50 giga data. That's kind of constant, that's all day, every day. So if we get that bit right, we're kind of doing something right. We've kind of managed a bit of work up front that means that we can smooth some of the pain and burden and again I watch every day pretty much my colleagues over in that collections team opening the big boxes and getting all the newspapers out from yesterday and putting them in the nice piles and we do the cataloging and the microfishing and I'm not suggesting that we're going to replace that immediately I'm suggesting that there are alternate ways of collecting the same data and eventually we will tip over. Eventually consumption will be digital so if we don't start thinking about it now we're only going to have to think about it tomorrow. This is where I want to be, this is kind of in my head, this is my little mini mission is getting us to a place where we are serenely, calmly part of our every day to day, we have mechanisms in place, we have tools in place and it's a part of our standard business. We're comfortable with building these robots and these automated things that will do that churn for us and the hard bit we leave to the humans because we've got the smart brains, we can do the actual clever thinking. What we want to do is do that in a common direction with content producers and other people and other projects to convince you that this is a viable way of collecting and if you're responsible for collecting things I think there's some really really nice stuff in the heart of all of that which is going to make some genuine person savings. That was my talk, thank you very much for being very patient and listening. Any questions? I'll know if there's a question. Anyway, not to worry, I'll show. You very kindly use the term analysis for the point at which technically as you're shifting the automated bottlenecks that's where it's going to end up. And we know about machine analytics and the analytical power of what can be done. Does that mean that the human dimension might in the future in your dream be more of a buffer area for editing or verification than actually manually doing the analysis component? I think that's an excellent question. If I take off my I am a National Library employee and probably have a he says looking a bill without looking a bill personally I think that's absolutely right. I think the scale at which we're creating content we won't be able to keep up with the analysis properly so we have to put these mechanisms in place and so actually Douglas' talk will probably talk about some mechanisms that give us insight into that was a nice segue through but I think it's the answer. Otherwise we die. I think we simply have a problem of volume and scale if we don't start to automate and put some shims in the way. Any other questions? Awesome. Well thank you very much, I'm going to hand over to Amy so thank you very much for your time.