 Rwy'n fath yw'r ysgolio ymddangos a oedd ymell yn ymell arnaeth ydyn nhw ychydig. A'r ysgolio'r ymlŷn i'ch bod yn gweithio'n ymwneud ffieithio'n gweithio'r ysgolio. Rwy'n ffocws am ddweud o'r ddweud o'r ddechrau yma, a'r Rhys Gwyrd yn gweithio'n ddweud o'r tecnoleg i gael y ddweud o'r ddweudio a'r ddweudio'n ddweud o'r ddweud o'r dod o'r ddweud i'ch cyfrifiadau. I'm going to do that and then any minute now we're going to have some pictures and it's going to be cool and awesome. So the collection started life I guess maybe three or four years ago in the library. We are super grateful for the Amplifier website and that kind of underlying company called DRM for them helping us with this project. So the idea was like there's a whole bunch of published music in New Zealand and it's being distributed through the Amplifier website. We want to ingest it as part of the legal deposit collection. We don't really want to go and harvest the website, I mean we could do, but that's going to take forever. And we haven't got time for that. And so in the end they kind of said, well look, why don't we just give you a hard drive with it all on? So we got a hard drive and it was 2.6 terabytes. On this side closest to me that's a screen dump of a tool that I use quite a lot to analyse data stores, it's called Windy Arstat. And each one of those blocks is clusters of files that are similar. The colours are the same, the size is the kind of bytes number of bytes. And it's kind of quite cut. You can see some regular stuff at the top and then it just kind of descends into what looks like, frankly, madness. And I guess the closest equivalence I could find was this kind of semi-well-organised piles of stuff. And that's not to besmirch what they were doing. It was dumps out of two different CMSs. There was some human stuff in there, 160,000 files, all out of a Mac OS. So there was a whole bunch of resource forks and files that did nothing for us really. There was promo pictures and audio and a few bits of video relatively well-organised like pseudo-sane label artist work, album EP single whatever, and track. But we kind of needed to do a bit of work to turn it into something else. And there was zip files in there as well. So it's just kind of like this big ego, his two and a half terabytes, go and make it into library stuff. And so that's kind of what we wanted to look like. And this side, this is the same tool showing the processed SIP. So this is the submission ingest package, what we ingest. And each of these kind of blocks broadly represents one of those items as we ingested it into the collection. So this is basically a story of how we went from this kind of like organised working active BAU into something that we can kind of deal with in library land. So kind of my part in this job has been to kind of, I don't know, figure it out. I could talk at length about even just moving the 2.6 terabytes onto corporate storage sanely and counting everything and making sure everything was there the third time around when we did it. And every time takes about four days. And I don't know, life's hard sometimes in the public sector. So we had hard drives on a file. We've got phase one. There was this awesome spreadsheet in there which had a whole bunch of tracks listed individually. So we knew a whole bunch of context about some of the content. And then so that was phase one. And then we had a whole bunch of separate spreadsheets. As you can see, like one and a half thousand of them. And then we had a whole bunch of stuff that was like, I don't know, there's just a bunch of stuff, get on with it. So for each of these phases, we had to identify every item. This song that exists conceptually on this spreadsheet, there's a corresponding file for it. And on that album that's described by tracks, there's all those tracks are available. We've got them all and there's an image maybe or not an image. And just kind of basically just trying to sift it through a net and trying to turn it into something sane. We wanted to identify the right ones because it'd be kind of bad if we got the wrong ones. We need to make records and we're just going to talk to you about that process because frankly for me that's the magical bit. We need to look for dupes. Dupes were the biggest thing of this project. I'll talk about that in a bit more in a second. And then in the end kind of turn the handle, package them up real neat and then just send them off to the mechanical stuff. So for me it was a project of tool building. We know what we want to do. We roughly know how we want to get there. But we just kind of don't have the tools to do it often in this sector. We don't have a lot of money. We don't have a lot of dev support. Those of you who might have seen me talk occasionally will have heard me describe learning Python because I realised no one was coming to save me. And this kind of growing pile of things I needed to do was not getting any smaller. And that's kind of like Python is the one trick pony that I use all the time to solve all my problems in the world ever. Sometimes not very successful. So we've just built a heap of different tools. I'm going to talk about a few of them now. This one was a fixity tool. So this is just like, these are screen grabs. But this thing just floats on my desktop and it helped me really dealing with fixity file fixity. So our two files bitwise identical. I could run logs and I could kind of do comparisons that way. But I'm unpacking zips and I'm checking whether this version of that file is the same version. Is that file do we need to keep them both? And I made this little widget that I can just kind of drop a file onto it and drop another file onto it and I'll go, yeah man, they're the same. Don't worry about it or like, oh no, they're different. You need to keep both of them. And so that didn't exist. I made it. It does now. You can use it if you want us on GitHub and stuff. It's pretty enough. But it does the job. So I use that quite a lot. The de-duping we've talked about, Rich is going to talk about fuzzy and how we use fuzzy logic heaps to help us do a bunch of de-duping. My favorite thing that we did in de-duping was the biggest matters thing of this whole project. So we packaged a whole bunch of stuff in phase one. We go to phase two. We need to make sure we don't have an existing item. We need to make sure we haven't packaged the same thing twice in this phase and in a previous phase. And so Rich and I must have spent weeks, days de-duping manually. It was interesting listening to Chris talk on his keynote yesterday talking about the human effort. We built tools. We automated a whole bunch of stuff. But we also spend a lot of time humanly dragging and moving stuff around. So one of the ways of de-duping images that we came up with was take all the images, turn them down into 64x64 pixel images and then do a pair-wise comparison for their RMSC, their root mean square error between two images. And if we get a zero, they're probably the same image. And so having humanly de-duped thousands of things and found lots of logical dupes and lots of fixity dupes and lots of fuzzy dupes, we still at the end of the project found dupes by using weird, not sensible kind of methods. It kind of like told us that there's something in this duping problem. Anyway, we're making up tools as we go along. This was probably my favorite part. Because of the transfer problem, it took me forever to move everything. So we ended up, and I borrowed this, I say borrowed, I very, very plainly stole this from the kind of broadcasting world where we make like a low-res online copy that we can edit and we can move those files around elegantly because they're small and compact. And I did that for the whole file system and that means that I can process a file, file name level. And then I just kind of trap those instructions and then I can replay them on the live collection. And it meant that I could rehearse manoeuvres and I could see what happens when I do deletes and renames and moves and I can hand off the collection to other colleagues. We had a contractor doing some de-duping for us. And it meant I could give him the entire collection on a memory stick without any fear of compromising the intellectual property because it doesn't exist. All the files in the shadow collection are nothing. There's nothing in them. It's just a file with a file name. But he could do the de-duping against the file name about the structure. Give me those instructions back and I replay it on the main collection and bingo, like we've done this real nice way of dealing with kind of like instruction compression. Again, it's on GitHub. You're more than welcome to have a go. It's naff, but you know, it saved us weeks and weeks and weeks of effort. And then for me, the other tool that we had was philosophical. And this is my kind of last slide for Hunter Richard. I was trying to find another way of saying perfect is the enemy of good. And I ended on this guy and what's and what. And he was an engineer in the Second World War and he was kind of part of delivering radar. So we could see him. We could see aircraft, you know, 2000, 2500 kilometers away. And I love it because it kind of works with what we wanted to do. And we kind of set the project up and we looked at what we were, what we needed to achieve and we kind of talking to our to our managers and grownups and saying like, look, this is what we think we need to do. And about halfway through the project, we kind of realized that we just have to be a little bit more flexible. We had to just be comfortable with there being errors and the errors were in a controlled way. And rather than trying to make everything perfect every time, we felt like we were just going to be there forever. There's a bunch of other kind of internal reasons why we did that. But for us, we were kind of like slowly, slowly trudging along and then we kind of just said, you know what? We just have to be okay with some error. We have to be okay with there being some noise in what's happening because of the sheer volume that we're doing. As long as we record our records in a way that clearly flags the machine did this work, we know we can go back and clean if we want to. We know there's a reason it's kind of justifying yourself in the record and we could suddenly start moving with a little bit of pace and speed. So for me, that kind of allowing yourself to be as wrong as we are wrong with humans doing human processing. We were allowed to be wrong. It's okay. It doesn't have to be perfect every time. And that really for me was was was my kind of like third favorite tool that we used in this project. And now I'm going to hand you to Richard. Thank you. Kia ora, everyone. I'm Richard. I'm a collection description librarian, also known as a cataloger at the National Library. This role involves obviously cataloging individual publications, but more and more we're working on projects like this one here to generate records in bulk. The records we do create, they're encoded in mark format and we follow the RDA cataloging guidelines. I also catalog digital music as part of my regular work. So individually cataloging digital releases that we collect, which is always interesting. And I get to basically hear the New Zealand's music coming into our collection. All the little niches and sub-genres out there. For example, Dungeonsynth. Has anyone heard of that? Wow. Slides Medal. That's a great one too. It never stops fascinating me this work. There's only a handful of catalogers in our team that actually do this digital music cataloging as well. And currently we describe between 1,000 and 1,500 digital releases a year. So given that current resourcing and not considering regular workflow, it would take at least five years to catalog this diarium project. So obviously that's not going to be feasible like what we've talked about. So I'm going to talk about two main pieces of work today. Matching to existing records so that we can use those and building new records. So we began each phase looking for existing records to use. Prior to this project, we had about 9,500 digital music releases in our collection. So it's highly likely that we have quite a few in this set already. To find these existing records, we wanted to use an automated approach. And we needed to do more than that. We needed to use methods to handle the variation in metadata between the two sets of metadata that we were working with. So we've got spreadsheets from diarium and the records that we've catalogued over time. For example, my name could be recorded in different ways. So exact matching wouldn't work in this case or very well. To solve this, I wrote a script to compare the metadata in the diarium master spreadsheet. It's the first spreadsheet, first phase spreadsheet with all digital music records in our collection. It pulled out the title, artist and tracklist strings for each release. Then used fuzzy matching to compare how similar those strings were with those in the bib records. The process returned a fuzzy score, so between 0 and 100 and with 100 being identical match between the string. If the fuzzy score was over a set threshold for all strings, for all title, artist and tracklist, then we would say it's identified as a potential match. This is just a kind of example of what this looks like. You can see the fuzzy score there with the title. Two slightly different titles and it's got a score of 92 and with artist and track names. As the difference in string increases, the score decreases. So you can see it was important to set a good threshold so that we got good matches. Basically through trial and error we worked out that at about a score of 17 above, you'd get a reasonably good match. So playing fuzzy matching enabled us to catch a much wider set of possible existing records to use. Then it was on cataloging staff, me and my colleagues to make a decision is that actually a good match and we have all these rules around if it's the same publisher, is it the same year and all this sort of stuff. But having this kind of data here was really useful to make that decision quickly. The result was we found over 500 records and this is just an example of one in the catalogue. As Jay said, we used fuzzy throughout the project as well and it became an integral tool to de-jupe the different phases and locating physical records to copy metadata from into the digital records. Where we couldn't match with the existing records we needed to create new records so this involved using multiple sources of metadata to populate custom record templates. I'll just give a brief overview of that process using the example of the master spreadsheet which had all the good information from DRM. So the first step starts with looking at our standard practice for cataloging digital music and making decisions on the available metadata and where it could be used. So this here is a custom record template and you can see just filler text for title artists and so on. And we created multiple templates to handle variation in the available metadata for each phase. So the next step Jay wrote a script to generate records by reading the master spreadsheet and populating the record template for each release. So you can see the spreadsheet metadata there just as a kind of snapshot. The script used pie mark to create mark data and we had some fine tuning to kind of work out issues as we created the records and there was a bit of back and forth between us. Mark can get very pedantic I think that's probably a good word. So once that was sorted the record was created in our library management system via the API. At this point the record now has a record ID which can be used to link the ingested item to our library management system. So linking the digital archive to the library management system. This is an example of one of the fullest type of records that we generated in the catalogs. This is the catalog view. The front end catalog I should say. We adopted an approach to our work that the records don't have to be finished the first time we created them as well. We can do extra passes to update the records. For example a automated copying metadata from matching physical records we'd created so we would describe CDs and vinyl records for matching titles. Automating the copying of that metadata into the digital records. So you can see a contributor possibly at the back you might not be able to see but a contributor with some names there and clickable links that will bring up other items in our collection for those artists. At the other end of the scale is really brief records that we created. This is where we had very little metadata at all to create records. So as you can see there's not really much there but this key word searchable and they're easily identifiable in our systems if we do want to revisit them at some point. So yeah again hammering that point being okay with imperfection and that short record was a good example of that. As a cataloger it's quite hard. You want to see these records kind of full and consistent with the rest of the records in the collection but obviously that's not feasible if it's going to take five, seven years or whatever to do that. So yeah we did make decisions on what was good enough to progress the work forward and it's something we're growing more comfortable with on these bulk projects. I think the benefit of automating stuff has been kind of shown as well in this talk. The good thing is many of these tools we can continue to apply in future projects and or adapt in different ways as well. And finally it's been a good project to work on for me personally. I've had the time and space to try things out to solve problems and continued to develop my own kind of coding skills as well which has had some good benefits in terms of already being able to see problems and create scripts that can identify issues with our records across the whole of the National Library catalogue. So that's been fun. So I'm just really thankful for Jay's work on this project and for our colleagues and managers supporting that kind of approach as well. Thanks everyone for listening. Thank you very much Jane Butcher. We've got quite a bit of time left for any questions you may have. Are we willing to answer? I've got two questions. Actually I've got three questions. Has the amplifier not existed? I came in late. Does it exist? It doesn't exist. Does it exist? Yeah, I thought it existed. So two questions. Do you, you said you used Discogs? So I just wonder whether you are writing into your mark records, Discog IDs or Music Brands IDs, those other, that's straight no. No, yeah, we weren't. So really the question is why aren't you, but we can keep that for another day. And the other question is about internal metadata. A lot of these will have ID2, ID3 tags. Were you extracting those? Are you writing those? Or is that a question for another day? No, I can talk to that. Some of them did, most of them didn't. Actually a lot of it was WAV and it didn't have any metadata at all. And because the MP3s were born out of their CMS, they hadn't written ID3 into a lot of the MP3 either. So we kind of looked at it and it just wasn't generating good enough consistent results to make it worth kind of leveraging that. There's kind of a few files had it in there. We also are not writing that because it's not the process that we use. So we would have taken that, we took the best quality. So just to give you some insight into kind of how complex some of this stuff was. For one simple album, we might have had maybe 12 WAVs and a JPEG image. Happy days, that's super easy. Another one, we might have had three versions and inside each of those versions, there were three versions of the audio. So an M, Mastered for iTunes, the CD audio and the website MP3. And so it was a little bit of bouncing around to figure those out. So just given the complexity of all of those things, we were frankly lucky to find the kind of registered item sanely and that kind of was a different process. Once we handed that off, it goes into the NDHA kind of system and we generate MP3s for everything as a work process and there isn't a plug-in that writes stuff in. We're not cataloging at the track level, we're cataloging at the works level. So a single EP album, what have you. So at the project kind of level, we often knew what the track was called, but by the time it becomes an object, that's kind of gone because we're not atomically working at a track. So it would be nice to do, we probably should do it. We don't and some of the reasons are is the added kind of labour around that and then also the one that gets consumed predominantly happens well away from the record anyway, if that makes sense. To answer your question? Cool. I like the way you said you probably should do it and I think I agree with that. And I think you should probably do the music brains thing as well. That's another problem for catalogers on me, so I don't know. Thank you. Anyone else? Any questions? There was a question about where is this stuff. It's on GitHub. I forget which one of my accounts it's under. I don't know how to socialise. I'll probably tweet it with the NDF hashtag or something. So if it's interesting just look out for a tweet from Oh, I don't even know what my Twitter account is. Something. I think it's NDH. Oh, I'm not using that one at the moment because of reasons. I'll find a way. We'll make sure it is somehow on Twitter reasons. Okay, thanks very much. Thanks again Jay and Richard and it's now afternoon tea time. Get some refreshment in Oceania.