 Thank you so much everyone for inviting me along to speak to you today about some of the work we're doing at the library. I'm coming to you from Darrowall country so I'd like to pay my respects to Elders past, present and emerging and thank them for their custodian chip of the land that I get to call home. I've been working from home for about two years now as everybody else has and if you do hear any weird sounds in the background it's just my dog snoring on the floor down there so I'll have to kind of make an apology for them every time I present these days so it'll just be weird. So I wanted to talk to you a little bit about our Tiger projects so just to give you some context you may have seen that the library released a new catalogue early in earlier on in the year so we actually rebuilt a brand new library catalogue from the ground up it's been a very significant software development project that we've been working on for several years and one of the features of the new catalogue is a digital collections portal. So the DC portal is essentially a subset of the library's full archive where users can more easily browse and discover open access digital material from the collections so we have a very very large collection there's already around well there's over five million items already available in DC right now and we're still adding more material to the site all the time so we've got a very very large very very large archive and it was very challenging previously to find some of the digital collections in the full archive from the library and one of the reasons for that is that we don't always create what we call file level metadata in our collections so we don't describe individual files and images when we're through a traditional cataloging process every time so in order to be able to make those lower level more granular sort of levels of the collection more discoverable we needed to find a way to create file level metadata but because we are you know we're dealing with such a large collection we needed to find a way that we could do that at scale as you know cost effectively as possible and as quickly as possible as well so that inevitably led us to exploring you know the use of machine learning and AI services so Tiger is effectively the library's automated image tagging project Tiger stands for tagging images generically for exploration and research the reason why Tiger is is a little different to other sort of image tagging projects out there and other examples that you may have seen is that we actually tag all of the images from the library's collection with three different commercial off-the-shelf services so we tag everything with Google, Microsoft and Amazon and the Tiger component is actually our custom algorithm that sits over the top of that that is effectively like our built-in QA system or process so the idea is that the Tiger component analyzes all of the raw data that we get from the three different services and tries to identify the most relevant and the most accurate terms from the raw data and we only deliver the Tiger tags to the user and not the full raw list so the algorithm is kind of a rules-based approach it looks at things like confidence level ratings of different terms across the data that we receive from the three services and we also take take into consideration things like repetition and duplication of terms across the services so because we're using three different services we can compare and contrast them to each other so just to break that down in a little bit more detail so it's effectively the rules-based system is essentially like a series of mathematical equations that looks at all the different raw data that we get from those services and we look at things like the confidence level rating so the higher the confidence level the more likely we will accept that term because higher confidence generally not always but generally equates to higher level of accuracy and again we will look at things like you know did two of the three services get the same tag if so we're more likely to wait that higher than a term that was only received by or delivered by one of the services so one thing to note about the rules as well as this slide makes it feel very succinct and simple and straightforward but this was months and months of work to get to this because we have such a large and such a varied collection it was really challenging to come up with a series of rules that work for the majority of the collection so yes it was quite quite a challenging process and there was a lot of tweaking and changing and adding and subtracting rules to sort of get to this sweet spot so just to show you what it looks like in practice so this is an illustration of a rainbow lorikeet from our Derby collection it has been tagged as you can see with animal bird and parrot so these are the three tags that we deliver to the user that has gone through the tiger process so it would mean for example if someone was doing research around parrots this particular image would come up in their search results but if we look at the raw tags that were received from the three different services so they're all of the tags that are marked here in green so all of the green tags here have actually been removed as part of the tiger process because they didn't meet our various rules from the custom algorithm so the idea with tiger is it's trying to remove as much of that white noise as possible when it comes to all of the different tags that we receive from the three services so another another example so this is an image of circular key and there's obviously quite a lot going on in this image but one of the key themes is that there are cable cars or trams in the image so you can see that the tiger tags that we've delivered are cable car transportation and vehicle but if we look at the longer list of tags that were eliminated you can see that there's quite a lot of other terms here many of which have no relevance to this image and we're getting all the way down to less than five percent level of accuracy in the confidence of those terms so really getting into the weeds and tiger was able to eliminate those terms and as well as being able to remove as much of that white noise as possible one of the really key parts of this project is that it was about doing this process doing this QA process with as little or basically no human intervention at all so the idea was we needed to build a system that we where we didn't require staff to be we had more confidence where we didn't require staff to review the data that we were making available to the user so that's why we've been you know quite strict with the rules that we've applied so that we are removing as much of that white noise as we possibly can so as well as creating more data about at the file level so more of that file level metadata which is what we had set out to do one of the really interesting things that tiger has been able to deliver to us is other ways or lenses of viewing the collection that weren't really possible before so I'm just going to show you a few examples so this is one that I really enjoy so this is the tag of triangle and you can see here that each of the images that have been tagged with triangle is because they have very distinct triangular shapes in the frames of the images so even if these are all images that we did create our own file level metadata for you know through a traditional cataloging process triangle is not the kind of tag that we would have applied to these images because it's not relevant to the context or the you know the content within the images and all the collections themselves so this has really created a really interesting and quite a unique way of exploring the collection that we would never have been able to make possible before and another example of that is this tag for example of our flightless bird so each of the individual images here in this set they would have been tagged with things like penguin and emu and cassowary but never before could we have had a common theme or a way to start browsing those collections together so they have been brought together with the term of flightless bird and it means that we can start bringing sort of those disparate collections together for the first time which is really interesting again this is another example I really like so this is just a very simple tag of the word sign you can see that we're seeing things like road signs and street signs and protest signs but what is really interesting here is we're also seeing archival material alongside contemporary material and it's not impossible to you know to search the collection with this kind of lens but we've made it a lot easier to do that with you know with the work that we've done with tiger as well which has been really interesting so while we've sort of you know at this point in the project which was probably around about we launched back in I think it was the beta version was back last year maybe in June last year so it's been over a year so when we got to that original point we could we could basically say yes we checked the box of what we set out the original aim of what we set out to do which is to create more file level metadata and to make those more granular parts of the collection more more easily discoverable but what we one of the things we didn't do with this project and maybe one of the things we potentially made a little bit worse or harder for users was that we didn't make it easier for users to understand what the library actually had in the collection and that's something that a lot that's a feedback that we get all the time is that because the collection is so huge people just don't really know how to penetrate it and they don't understand what is in the library's collection because we have such a varied collection as well and makes it quite challenging to be really familiar with you know what we hold in the collection and so this tiger process hasn't really resolved that in any way so if we look at for example this this search here so I've done the search for violin and I'm seeing a series of results of images of people playing violin so that's exactly what I would expect to see in this kind of search you know for what I've searched for but what I didn't know to search for was anything that might have been related to this term and I would have had no understanding of what else the library might hold that would be related to my search of violin so if you have a look on the right hand side there you can see that there's a list of terms and these are all the related tags that had been identified as related to my search of violin so there's quite a few terms here that are relevant to what I've been looking for so musical instruments, stringed instrument, fiddle, violinist, violin family so lots of other terms that might be related but I didn't know to look for them or I didn't know that they existed so what I've had to do is start at that really granular file level and then as the user if I was interested broaden my search and go out further to see some of the other related terms within the collection so what I really wanted to explore was could we invert that process and actually allow users to start more broad by starting at the real high level and then drilling down to you know to get more specific with their searching as they got a better understanding of what we actually held in the collection and one of the one of the challenges we basically couldn't do that before with the original data that we had created through Tygo because what we had essentially done was just create a big bucket of thousands and thousands of tags from all the different millions of images that we had tagged but there was just yeah it was just a big bucket there was no hierarchy no structure there was no relational information you know it was just another sort of impenetrable dataset that we had created so in order to try and bring some structure and to bring like and you know group terms together we decided we wanted to explore the opportunity for creating a hierarchy with this data so to do that we decided to work with a pre-existing dataset so we worked with the Library of Congress thesaurus for graphic materials which is a very common cataloging standard that's used all around the world and we already use it at the library anyway so we thought okay well can we use this pre-existing taxonomy thesaurus you know hierarchical thesaurus as our foundation to try and replicate this as much as possible with the tags that we had created through Tygo so what we did first was a one-to-one matching process so we actually just tried to see using the lock thesaurus as the foundation how many of these terms could we replicate and in doing so just with this particular snippet under musical instruments here we're actually able to get a matches to about two-thirds of the of the terms which is actually you know a pretty good result that's probably more than what I was expecting so there were some terms here that didn't have any matches so for example ukulele we didn't have any images with the tag of ukulele so what we were able to do though even like if we remove those terms that that that had no matches we've still been able to create you know a fairly you know a fairly good hierarchical structure with the terms and suddenly we've got some relational information here and some grouping that allow would allow a user to for example start looking at musical instruments see everything that we had tagged with musical instruments but also see you know the children and grandchildren and the kind of like the hierarchical structure of the terms and see what else we have related to that area of interest for them so we're really really pleased with this result and I was quite happy with the the amount of matches that we got because I was expecting it to be a lot less than what we did get but there were still lots of terms that didn't get matched because they didn't have that sort of identical one-to-one matching so the secondary process that we that we went through was using natural language processing techniques such as lemming and stemming to try and bring in those secondary terms with things like you know matching things like synonyms and sort of inflections and different pluralized words to try and bring more of those orphaned terms into the structure so by doing a bit of experimenting in this space we're actually able to bring in quite a lot of additional terms so everything that's marked here in orange are there the new terms that we got to bring in through that process so we've effectively been able to double the hierarchy structure and made it more easy for users to be able to browse this data at the top level and then drill down so the last six months or so we've been working you know quite hard from a data point of view so building out this data in a structure in the new library's collections API that's the API that actually powers the new catalog so we've got all the data ready to go and now what we're actually currently in development is building an interface where users will actually be able to access and browse this data so the next few screens i'm showing you are just the design mock-ups but these are what are currently in development and what we'll have available you know hopefully pretty soon so what we'll be doing is building what we're calling a tag index so it's essentially will just be an alphabetized index where users can browse all of the different terms the tiger terms you know alphabetically and they'll be able to get some of the under better understanding of the structure that we have that we've been able to create with these new hierarchies that we've put together so it'll also have a bit of a predictive search so you'll be able to type in a term and see you know all the different terms that we have related to the the letters that you've typed into the into the search bar there and then once you actually select an individual tag page so this in this instance for locomotive we'll actually be able to show you our result set so where locomotive sits in that hierarchical structure so what are its siblings who's who are its parents its grandparents and great grandparents etc so you'll actually be able to sort of see it in context of the hierarchy that we have created and then you'll also be able to see some samples of the images that have been tagged with locomotive and then of course you can click through and and get a research results page to see everything that's been tagged with that keyword as well so it's basically creating just another way of accessing the same data we haven't added any additional data we haven't created any new tags with this process but what we've been able to do is just create a different level of structure which provides a different way into the into the material and into the data and hopefully allowing users to get that better and more comprehensive understanding of what we hold in the collection so this is where we're at at the moment we've still got quite a lot of work to do it's a very iterative project um we'll be looking at things like training our own custom algorithms rather than using those commercial services really really keen on building in a process for things like user-generated feedback and user-generated tags so because we don't have any human intervention in the tagging process if we can have a mechanism where we can get feedback from our researchers and users that will be really beneficial for us and then of course because this technology is ever changing we've also got the opportunity of exploring using different you know improved technologies and improved services and we may even go through the process of re-tagging our images you know further down the track to see if we can get more accurate results as well so really really hoping we'll have that tag index probably it was we were trying to get it done by the end of the year but that's getting pretty close so it might be more likely January but um you know watch this space because you know the index is going to go live you know hopefully in the next couple of months and we're really really keen to get feedback and see if it actually does you know improve users opportunity and ability to understand what the library has in the collection and get more specific with their searching as well so that's it from me thank you so much and I'll just stop