 So yes, I am Greg Rolston. I am at the National Library. I've been here nine or eight, I forget, it's ages. And you can tell you've been at the library too long when you get back into digital painting and you paint a picture of Turnbull himself. I call that one, his middle name is Horsborough because that's his middle name and it's weird. But anyway, so that's my background is in animation and illustrations, so I studied at uni, which is just to give you a bit of, I have no business doing what I'm doing, I guess. So you'll see as I go along. So anyway, what this is all about is, you know, Brewster made the call last year about doing all the books in New Zealand. That was something that the digitization team at the National Library. So I'm in the bulk digitization team that is like papers past and maps and all that kind of good stuff. I mean, that was something that we've been kind of rattling around with for the longest time. And at the gist of what it's about is it's, you know, how many books, so what I'm trying to find out is sort of this is a scoping exercise for doing, you know, if you want to do all the books, how many books is all the books? I don't really know just yet, but we'll get to that. But so yeah, they just, is how many books are published in, by or about New Zealand? What is already available digitally? How much is there left to digitize? And my new being open to the world of copyright, what can we digitize? Because that is quite possibly the most important one because we're not supposed to do anything illegal at DIA. It's actually the code of conduct. I know it's common sense, but you know, I'd like to stick with that. So you know, we only want to be digitizing stuff we can and we don't want to get into any of that other stuff because who's got time for that, right? So I was very fortunate last year that I got to go talk at PyCon here in Wellington. PyCon's a Python programming conference where I gave them a talk of where I was at that point in time. So this was like a, it was really great because this is like, I had these snapshots. So a year ago, I'll give you a rundown of what I was going through. Now you can see what I'm up to now. And man, I've learned so much and yet it's mad. So without any ado, law and order noise. So I had a sound effect, but I forgot to put it in. So the first thing was, is everyone loves the slide. It really has no purpose other than being badass. But yeah, ride the light and it's got reach for you 200 times. So that's how you know you've got, that's how you know you're jellied with your audience is that when they're going nuts over some stupid joke you made, went down better than when I belated them, which went down better than that. So anyway, this is how I got into this. So when I first started the digitalization team, there was this bit of work done by a consultant from a consultant company who had been given a whole lot of publication, New Zealand data and he gave us this ridiculous spreadsheet filled with pivot tables and stuff. And it wasn't ridiculous in the way he'd done it. It was just how big it was. And at the time we didn't have a modern version of Excel. So I was having to deal with it 60,000 lines at a time, which is uncool. So I made, that was kind of my introduction into it. I kind of put my hand up as, oh, I'm pretty good at technology, I'll work it out. And I did and wrote up a bit of report on it and what he'd found and all that kind of good stuff. And then I had this kind of genius idea of what if we matched up publications in Zed with the Huffy Trusted stuff. I mean, they do the open data thing pretty well and books are pretty rad and everyone loves that. So that was kind of the starting for this. And I love this painting and I love this story because I feel so much like this woman who, I forget her name, as it's like, I'm guessing by the laughter that you know where this is going. But this is a fresco on a church, it's like 150 years old. This really nice old lady goes, you know what, it could do with a bit of spruce up. And that's what she did. And so that's kind of how I feel sometimes with this stuff is I kind of have no business doing this. You know, I'm not a librarian, I work at a library. You know, I'm not a programmer, but I like to program stuff. And that's kind of thing, but I feel I'm getting better results than that. But I mean, that kind of in a way was successful. I mean, like it's kind of insane looking, but how much did the world like latch onto the story, right? Like that church has probably got way more visitors than it ever did. So can't be all bad. And apparently you can reverse it, but you know, I would almost argue, why would you want to? So this was my experience. Cause like when I first started, I was like, well, you know, I'll just do some sweet VLOOK UPS. You know, like Excel is pretty rad. You know, like I've got the newer version now, you know, I can do more than 6,000 lines. I can VLOOK UP like the best of them. But you know what Excel doesn't like looking at 400,000 books at once with like many, many fields and many variations on different things. So it was my good buddy Jay who does have a thing for horses. I mean, come on. Said, hey, why don't you, why don't you like looking to these Python things? And then I explained and then I was working away and he got me into it. And you know, I then started teaching myself. And then one day he was, I was explaining this mad scheme I had where I was going to save a JSON file to Dropbox and then like upload it to a server. And he's like, just use a database. You've just described what a database actually does. You should probably use that. So, you know, I was like, all right, fair call. Fair call. I'll learn databases. And that's when I, and then, you know, I was struggling with it and I was trying to use this data set that the sky had made that was his scoping exercise for trying to find out the books. And I realized I was trying to do the wrong thing with the wrong data or the right thing, whatever. It wasn't working. It was frustrating me. I was getting, you know, it was missing fields. There was like short titles instead of long titles. I couldn't tell what mistakes had been made, what mistakes had been introduced in his process or whatever. I mean, and I had a similar experience when I got given a whole lot of data from OCLC that Lorkham Dempsey who did a tour around using it a few years ago and did some talks. And he had a data set that they gave me very kindly. And I was having similar problems. It was like 60% of the top 100 of that were movies and that's not of any use to me. So that was when I said, you know what? I gotta get to know my data. Gotta take it out for dinner. Gotta get to know what it's into. Ooh, just call that one early. So, you know, the data's into turtles. Baby's murdered here. Stay away. Anyway, so that takes me on to my next thing, which is the best moments of learning I've had has been at Hackathon. So last year, Jake Tuso, my good friend with the horse, organized this Hackathon and then I got, I did some sweet graphics and got far too much credit for what I actually did. And they were, you know, like hanging out with these smart people, going to PyCon and learning them. And it's this really weird thing where you, you don't know what you don't know until you don't know you don't know it, which is like a horrible thing to say. And if someone has a better way of saying that, please let me know. But it's like just being around these people, like you've gone on an elastic search and stuff and I was just like, I have no idea what this is, but it sounds awesome. And just sort of this, of most of this thing of being around these smart people, you know, there's guys from Catalyst and all sorts of places, super smart people, and they're all super friendly and none of them are toothbrushes in their mouth and we're brushing their teeth, so that was good. Oh, we're in roller blades, so I can't complain with that. Right. That was where I was at a year ago. Now the talk proper starts. Probably ate too much time there, whatevs. So DIY, it's in our DNA. Everyone remembers this ad, it's one of the kids in the sandpit where they're giving the Aussies grief. And, you know, it's corny, marketing stuff, but you know what, there's a lot of sense in it. You know, like the New Zealanders have this love of sheds and doing stuff themselves and you know, just I'll come over and fix it and we should have some beers and a barbie afterwards and that's payment enough kind of thing. And you know, that's what, I guess I'm kind of trying to embrace with some of this stuff is, you know, I've got into my own DIY and working, you know, I've got my electronics bench, you've got my power tools and my measuring things for some paint for making stuff look pretty sweet afterwards. And, you know, what it turns out that programming and digital stuff, it's not really that different, right? Like I've got Python, which is my, I mean, I put it on the electronics bench, really to be honest, it's like duct tape glue, sort of, you've failed every, you've done and made a horrible mess with six different databases. Python them all together and it'll probably work. I don't know, it might be slow, but you know, it'll work, it's awesome. You know, MongoDB and Elasticsearch are both kind of variations, I guess, on the database, one's an indexing thing, one's a document store. And that to me is much like a rotary tool or a Dremel, if you're into brands, or like a router or a drill, you know? They both spin and cut stuff, they just do it for different things, they use them for different reasons. And that's kind of, you know, it's just like getting to know your tools, you know? I bought a rotary tool, it's changed my life and the workshop was great. And then I like to think of measuring tools and stuff as that sublime text, the logo for that, that's my favorite text editor, worth the 70 bucks or whatever it was. The only non-open-sourced thing there, I guess. But you can't use it for free, it'll just give you a bit of stick every 100 times you save it. But you know, that's where you lay out all your stuff. You know, you lay out your scripts, you measure stuff up, you do tests and stuff, and then once I'm ready, I bang it up into a web application-y thing and then it spits it out in Chrome and does some pretty sweet stuff. So this is some, so on the left, you've got my DIY output. So you have built my arcade machine. It runs on main, plays all Street Fighter and all the goodness and pretty much anything. There's an arcade sticker built from a super gun. And on the right, you've got some web applications I've built. So the first one was my attempt at visualizing the OCLC data. That's just a static pie graph that I made using a Python graphing library that I'm blanking on the name of. And the bottom one was my first attempt at using dc.js, which is a combination of cross filter and D3 and is bore as it is so cool. It's my favorite thing. And I learned about that at that hackathon as well. This guy came up and was like, oh, I did this, it's so easy. And then I was like, look, there's coders. I know that's not easy, you're lying. But actually, you know what? It is quite easy. He was right. I'll give him credit. So this is now where I get to the part where I'm talking about what I'm actually doing, I guess. So I've got three buckets of content. I've got the publications in the data set, which comes in my favorite format of all time, Mark. I'm lying. It's not my favorite format. I've then been reaching out, looking for OCLC numbers there, reaching out to the WorldCat service. And they've got some sweet link data. So I've been grabbing the adjacent link data to process and find more linkages and do good things. And then I also downloaded 6.5 million files from the HathiTrust and they're tab-delimited files. They call them the Hathi files. They're actually just CSV files with tabs, but whatever. And I've done stuff with them. So what I've got is I've got a stack of Mongo databases, which is actually one Mongo database, but you have collections in it, but it looks cool to make it a stack. So chucked my publications, NZ stuff in there, stored all my WorldCat documents. Then I spun up Elasticsearch and chucked all the 6.5 million Hathi files in there. And Elasticsearch was a real game changer for me. I mean, when I first started, I was writing scripts that were gonna take like three weeks to run. And that was if my computer didn't decide to reboot overnight or whatever, and then I'd lose my progress because I was doing it horribly. But Elasticsearch took me four hours to ingest all 6.5 million Hathi files. And it took me four minutes to find the matches and pubs in Z. That's kind of amazing. That blew my mind. And of course, the duct tape here is Python. So this is how I put data into databases, put it into then index it in Elasticsearch, and then I use a sweet module called Flask, which is what they call a minimal web applicationy front-endy thing. Builds pages, it's great. And also it gives you, like out of the box, kind of like API stuff. So you can kind of build your own APIs based on your databases real easy. And then to tie it off, I've got my final pipe of Python spinning through JavaScript in an API form into a web browser and I use a few different JavaScript libraries to make everything look all pretty and beautiful and do cool things. I'm not very good with JavaScript, I'm getting better though. But so now let's talk about how I've done this, right? So that's what I've done is I've made this crazy web application stitched together with my madness. And I wouldn't have been able to do it without my sandbox. So my sandbox is this laptop right here. This is a laptop that I fought long and hard to get off the network, off the corporate network. They gave it to me on the corporate network. I went away to a conference and had to install a car driver thing and that was the perfect excuse for me then complaining and getting a rebuilt non-corporate build, which now means I can install databases and what have yous and make do Python madness and take it home and spend all weekend harvesting things because it's my role. And I wouldn't be able to do that without the sandbox thing because it's like, I mean people are scared of technology, right? But often it's the scared of breaking it but you know, I kind of like breaking things. And but it's like this controlled breaking, right? Like you want to be able to break it and know you're not going to ruin everyone else's life around you. That's the best kind of breaking, just slightly inconvenient. I mean, I rebuilt this, I rebuilt the data set that my thing's working on now on Saturday morning. It took four hours of on and off scripting but you know, that's the kind of thing when you've got, when you can just do that, you can just rip it down, tear it back up, try something else, try something faster, try something slower. Oh, it's taking 10 hours. I'll just kill it now, start again, do it the old way, whatevs. So I just like this, this is Rad. This is a Dune reference. I was trying to find some kind of thing like fear is the mind killer, new fear or whatever but they were all kind of like trees with sunsets and stuff. So Dune speaks to me much more. But yeah, fear is the mind killer, you know? Like I learned about elastic search like a year ago but I was too scared to try it. So I tried it like two months ago, well not even that and it like changed my life. Like it was amazing, oh my God, where have you been? This is amazing. And it's like, I knew about it but I was too scared to kind of, I just like, I don't have the time, I don't have the energy. Oh, I'm already using Mongo, Mongo is good enough but you know, they do different things and you know, the things that, the speed that I get out of it has made my life so much easier and all it took was me to watch a webinar, do a Python tutorial, which was in earlier tutorials, more like some example code and I was rocking elastic search and it's amazing. So, you know, don't be scared, just like, you know, embrace it. Like learn to love your computer, you know? They're not scary, that guy's pretty scared but you know, hug your computer, you know, get into it. Like, you know, I mean, I always remember growing up, my mom was always asking me to do stupid things like can you unplug and plug in the VCR because I don't know how. And to me, when I think back on that, I'm like, it's a yellow cable into a yellow thing, a red one into a white or white into it. Like, how do you get that wrong? Like, why are you so scared to try the VCR? You know, it's color coded, man. And I just think that's just something that I've always had is this, I'm not willing, you know, if someone gives me a camera, I'll play with the camera and give it a shot and then it's only after I can't get it working that I'll normally ask for help but I would never ask for help first. I don't know, maybe it's just because I'm scared of people. So, anyway, the way, the sort of weird model that I've come up with, and I don't even know what it's with, it's probably got a name, Jay was like, oh, you should really find out the name for this, but I don't have time for that. So, when I was in image services, we used to do a lot of digitization, you know, we did this sort of boutiquey, digitizing stuff real beautifully and awesome. And so this is a whites aviation film neg that was probably done in pictures online. I can't remember. Should have written down the reference number of bad library person. But so the way we store pictures in the library and how I've always been taught to do it in a digitization point of view is you start off with your preservation master, your raw scan. So this is a 16 bit, so it wasn't, no, it is definitely a 16 bit, it was not done in pictures online, I take that back. They did not make, I'm rambling. So this is a 16 bit negative, straight off the camera, it's being scanned. You know, we say that as our preservation master, it's 16 bits, so it's got way more information than we can even see on a computer monitor. It's like 120 megs, it's like kind of ridiculous. But it means that we can then generate a modified master, which is the version that 99% of people will use and the only reason you would ever go back to that PM is if you did something horrible to your MM, like I did one time where I got Newton rings all over them. I wouldn't go into what Newton rings are, but they make weird patterns on stuff. I then had to regenerate MM files from like 1,000 PMs over a few spaces of a few months and write a whole lot of paperwork to change it. But because we've done that PMMM flow thing, it means I could do it. So that's kind of an attitude I've taken with the data, is that, so this is what I call my preservation master of my data, it's a very sexy mark file in its beautiful text editor view. So you've got all those weird and wonderful characters that I don't even know what they are. I have no idea what those are, US mean or other, some kind of divided things that's like the end of a line, but PyMongo does that for you, by the way. Python, split it from, anyway, it's good, it's great. Made by Ed Summ, it's a good guy. And so this is my modified master of my data. So you can see I had the madness of the mark and then I've gone, okay, well, I need an author, I need a date, need government, NUC, publisher, subjects. And I've even simplified it even further than the last time I rebuilt it because I decided that I don't want to separate out, I don't care if a subject's an AZX or whatever, I just want subjects. So I made them in a list instead of a dictionary, it's good times. It means that, and so because this is a version that I've made and I've still got my PM, I can change, you can see it says Wellington there and it's not covered in weird square brackets or what have yous. The way I get rid of square brackets is I zero them. And the way I get around that 1965, 60 thing, I just zero it. So if you don't know, it's better sometimes when you're dealing with massive amounts of data to just go, I just don't know, rather than to try and force something that might actually skew your data and do something weird. So that's how I answer that. So the thing is you're gonna wanna groom your data. Like this guy, you're probably gonna go too far at times. He won an award though, I should actually mention this guy won, that that beard won an award for best beard or I don't know man, those people are weird. But you will be doing this a lot. So like I said, I've regenerated my final data set for this thing, like on Saturday morning, I must have done it hundreds, well not hundreds, tens of times before. You get faster and better each time you do it, you learn something new, you pull in some more data, you know, you add your government, you add your language, you do all sorts of cool and wonderful things. And yeah, and it's, you just get better and faster at it each time. So this is, I guess the part where I talk about looking at data. So that's a screen grab of what the Mongo console, so that's the sort of thing that ships with Mongo that you can just like rock in and you command line and sort of see your data and stuff. I think there might actually be that old OCLC data, I don't know, I found it on my phone, I just liked it because it was all messed up. But you know, computers, they love it, they're like, yeah, this is sweet, I can do this. Everyone else is like, oh, I'm a sad panda, I don't know what to do with this, this is too hard to read, I don't know what I'm doing. But then, you know, like that's the magic of web APIs and stuff is you can get it to, or this is Jason coming through a web API, as you spit it out to this web API, then you've got this kind of nice clean view thing, you don't have that horrible black, well I think the black background's cool, but that's because I've always wanted to be a hacker. But you know, that's probably my own issues. But you know, so your panda here, he's all just like, oh yeah, it's a bit more approachable, yeah, I could do that. Computer still doesn't care, he's just like, yep, sweet data, yep, like it, structured, nice. But then, that came with this, it's like data visualization, that's how you make a happy panda, is you make pretty pictures, pretty graphs. Computer still doesn't care, he's just like, yep, yep, I did this, but whatever, you know, but that's a lot of the way with structuring data and stuff is more about human readability than it is necessarily. I mean, the computer obviously, your system's gonna have particular needs, you know, but at the end of the day, you need to be able to read it yourself and that's quite important. So this is what I came up with in the end, this is my homepage for my data visualization-y thing, exploring the books. So I'm still going through it, adding in different things. So put a little sweet ad for this talk, so thank you for coming, I see you all saw the ad and were impressed by it. You know, I got to do cool things like pulling out random subject facts, I can say that, oh, it's a dumb one because it's about readers, I don't even know what a reader is, but okay, let's go into location fact one, location Wellington, most published author is Standards Association of Australia and there's actually two versions of that title, so that's even twice, that number's twice as high as it was, I didn't group that down very well and their biggest publisher is Standards New Zealand, so it's a really boring one as well and again, readers, God, should have read that before I put it in, yeah, it's terrible, but anyway, so this is the sort of the core data visualization-y thing, so this is my new favorite subject, which is cat fiction. This is the 49th most populated subject in the pubs New Zealand data set and I've added in things so you can flick between places and languages and there's a little timeline and scrubs which I'll probably run out of time, but I can do a demo or you can come see me in the break and I'll show you a demo in the actual library stand, this has been here for the last couple of days. So, yep, skipping on, right, so being in data ninja, so you feel pretty damn awesome, like I've got this little sort of dance when I pull something off, you know, it's a little, whoo, whoo, whoo, and you know, that's like, you know, that feeling of like smashing it out of like, oh man, I've been working on this for months and I just nailed it, you know, it's, my wife was never happier than when I've finally finished something because I go from being a ball of stress and oh my God, everything's on fire too, and it's like almost like this crazy quick turnaround where you go from, I have no idea what I'm doing to I just smashed that and I've totally worked out how many books there are or whatever, and it's a pretty damn awesome feeling and if you're into problem solving and stuff, it's kind of like the ultimate in problem solving, I've always thought the term a natural high was the dumbest thing ever, but then once I started like finding stuff or nailing these technology problems, like I was like, okay, so that's what a natural high is, right, okay, I get you, I can deal with that. So, one of the few mistakes that I think a lot of people who get who are self-taught data people or data scientists, coder people, whatever, is you wanna, you know, you run your scripts and you get some sweet results and you're like, oh man, it's the end of the world, there's like three OCLC numbers for the whole of PubZnZ or, oh no, they're all movies or whatever, and then you run your script again and you realize, actually no, my script's totally wrong and I just overwrit that one field 15 times or whatever, so no, slow down, check your data, don't just jump to conclusions, that's an office space reference, that's my favorite movie, it's awesome, yeah, TPS reports, blah, blah, blah, George Michael, Michael Bolton, whatever, anyway, okay, so, oh no, I ruined the joke again, but damn, I did this on my practice, right, so the joke was supposed to be, I'm not about to get these things tattooed onto my body, but if I was, these are the numbers I would have and let me tell you that I made this slide before I regenerated the data set on Saturday, so it's a good thing I didn't get that tattoo because those numbers are wrong, but this is generally what I'm trying to look for, you know, it's like how many books are there, how many of them are in a digital format, I've often used the term in my little thing digitized, I'm actually looking for digital, I don't really care if it's being digitized or if it's born digital, I mean, a fully typed out book that's word accurate is way better probably than like scan pages or whatever, depending on what your use is, I don't know, but you know, we've got about 68,000 authors, roughly, that I reckon that'll come down quite a bit once I start mapping it to other things and trying to like cluster them up better, because remember, we've got to find those dudes that are dudes and dudettes that died before 64, which are what we call our safe authors, so just based on purely the pubs in Z data, I can tell you that there's just under 4,000 safe authors that we could probably go out tomorrow and digitize all their books and be pretty stoked. I wish I had a number of how many books that was, but you know, the hindsight and the lack. So yeah, I mean, and then there's the grand ideas that this thing isn't gonna answer, but you know, hopefully it starts some discussion, and you know, this has been one of the good things about showing people this, is these are some of the things that, you know, I can now start asked talking to people, like what is digitized enough? You know, do you need images and text? Do you need both? Do you just need one? I mean, a lot of the Gutenberg stuff, for example, is just pure text. Would we consider that not digitized enough, even if the accuracy is like 100%, I don't know. Someone should tell me, I don't know, we should have this discussion. Do we have quality expectations? You know, if we are getting images, do they have to all be 300 DPI at the scale of the item? Do we expect 100% character accuracy? I mean, you know, OCR is a strange beast and you know, it'll often tell you it's 95% accurate, but if you actually look at the words, you're like, there's no way that's 95% accurate. So, you know, like, how do you even measure that? Is there, if you can't trust the Abbey output or whatever, what is your metric for measuring that? I mean, I've often talked to Jay about making crazy dictionary lookup scripty things that would like tell you, using basically spell checkers across works to see how good it is, but, you know, that's like way beyond me. That's like Douglas Bagnell level stuff. So, and then the other thing is, okay, required level of access. You know, like, let's say a book is published tomorrow. They only ever put it on Amazon, but it's a digital copy and they're in New Zealand living overseas. They haven't done their own thing. You know, do we need to necessarily have that or if we know it's, Amazon's a bad example, we'll highly trust maybe better. I don't know, if people can get it, do we need to redo it? What's the level of access that we need? And that's it. So, I've run through it. I've probably run over time. I wasn't even watching. So, if anyone has any questions. I've got one question. I've got a burning question. No. I'm happy with that. You've run over time anyway, but thanks. Yeah, yeah, no worries.