 Hi, thank you for joining this CNI fall membership meeting project brief on machine learning for GIS, the ML for GIS project, striving for scalable processing of scan map images. I'm Michael Shensky, I'm the GIS and geospatial data coordinator for the UT libraries University of Texas at Austin, and I'm presenting today on behalf of my project team, including Donna Garari and Aaron Cho. This work was supported by good systems research grand challenge at the University of Texas at Austin, and really excited to talk to you today about this project. First, I want to give a little bit of background about what led to this project, which started in the fall of 2020. The motivation really came from the fact that the UT libraries manages a growing collection of over 60,000 scan map images that currently need expanded metadata and need to be georeferenced in order for us to improve discovery and access to these valuable resources. So really our goal was to enhance these scan map images in our collections so that we could add them to our Texas geodata portal and make them easier to find for our campus community as well as members of the public. One of our challenges that we were facing is that, you know, we have this very large number of maps, tens of thousands that needed to still be processed. And if we were to try and process these maps manually, you know, we did a little bit of math to try and figure out how much work that might require. And if we estimate that about 15 minutes is required to manually georeference and develop metadata for one map would take almost eight years for a single person working 40 hour weeks to process all 60,000 of these maps. So it's a very long period of time with, you know, even in the best of situations where you know there'd be dedicated staff support just for this effort. So we're really interested in, you know, trying to see if there's anything that we can do to speed up this process, make it easier, and perhaps automate it so that it would be less manual work overall. So our project goal was to really accelerate the processing of the scan map images by initially leveraging crowdsourcing to produce map annotations that can be processed to then generate metadata to georeference scan map images, and ultimately our goal. Our long term goal through this project was to train machine learning algorithms to process new maps without requiring much manual intervention at all. And so that is what we were hoping to get through through this project which was envisioned as a multi year project and really what I'll be speaking about today is the first year of this project and what we were able to produce and our plan for the years ahead. I do want to introduce our project team. So we had two people from the libraries myself included that brought to this team really, you know, a really extensive collection of tens of thousands of scan map images, as well as experience with development and experience with georeferencing scan map images, which would make them more useful in GS software easier to use in geographic information system software. And really that that's that was a key goal for us is taking these maps and our libraries collections and getting them to a point where they could be useful in geographic information system software, which is what many members of the campus community are using for geospatial research. We also had involvement from the UT i School University of Texas at Austin i School, and there we have strong experience with developing annotation interfaces and familiarity with crowdsourcing platforms which is really going to be integral to the success of this project. I'm also fortunate to have project team members from the city of Austin, where they have their own unique collection of scan map images and in some cases very complex map images that also were in need of georeferencing and metadata so they were able to bring these use cases and unique maps to the table as well that that were going to be helpful for us as we tried to develop a really generalizable workflow that could work for a variety of scan map images. Our first step really was, you know, figuring out how we could facilitate the annotation creation process. So when we're looking at scan map images. We, you know, have tried other approaches before mostly using geographic information system software itself to try and georeference images and create metadata for scan map images. And that process can be a little slow working in in GIS software. It's not really possible to crowdsource that process easily using GIS software. So we're looking to develop a new annotation interface that would allow us to have some additional flexibility with how we really wanted to to move forward with crowdsourcing the annotation process for the scan map images. So we wanted this new annotation interface to be customizable to allow us to include built in instructions which is going to be really important if we are trying to crowdsource the annotation development process so that users who maybe hadn't had never done this work before would be able to understand the workflow, be able to to see examples of what we're looking to see produced in regard to annotations. We also wanted to have support for multiple annotation types. So we wanted it to be very specific to do the types of materials that we were processing in our case scan map images. We wanted to be able to customize the interface so that we could specifically have users of this tool pick out specific items or elements or pieces of information from a scan map and identify those as particular things in the map that you know we're hoping would be things that could be identified and then information about those elements can be extracted. We wanted this to be really easy to use again this was going to be really important. If we were hoping to roll this out to users who didn't have previous experience working with scan map images or annotation. So we really wanted this to be a little bit faster than what the annotation process might look like in GS software other tools that are more multi purpose tools this annotation interface is really designed to be streamlined and focus on solely facilitating the creation of annotations. And we wanted to support a wide variety of maps so there are, you know, among the tens of thousands of maps in our ut libraries collections most of which are coming from our pericastinata library map collection. There are many different map collections and map series that are part of the larger collection. And you know that each of those maps is so different that we wanted it to be very adaptable so that we could use it to support annotations for diverse selection of maps. And what we see here in the screenshot on the right is a image of the annotation interface itself and the tutorial that we are able to build into the annotation interface to provide instructions for users involved in the crowdsourcing process so that they could, you know, see clear instructions for what we are hoping to achieve through use of the tool. And here we see an overview of our planned workflow. So our goal was to assemble these scan map images and organize them, prepare them for annotation and get them to the point where they could be loaded into the annotation tool. The next step would be actually processing those annotations and taking the annotations that were created by our crowdsourced users, and then which would be exported from the tool itself in a standardized format. And in this case we decided to use JavaScript object notation and to then use the scripted process to generate metadata from those annotations into also georeference the maps from the information that we were able to extract from the annotations and then use that then resulting georeferenced scan map image, as well as the metadata that it has been created for it to then add that map to our Texas geodata portal, which requires the maps be georeferenced and have metadata created for them before they can be added to the portal to really make those more discoverable. So while our scan map images are currently available on our pericastinata library map collection website, that website doesn't provide search functionality because of very, very limited metadata is available for many of these maps. And so by creating this metadata and georeferencing the map images, we can make them available through this more robust portal that makes it easier to search and browse for maps. And so that was a key goal for us. Here we see two screenshots really zoomed in pretty close on the annotation interface. And we're looking specifically at annotations that are in the process of being created. And I want to talk a little bit about how we are planning to use app annotations specifically to create metadata and in georeference maps. So our plan is to use annotation information that is specifically identifying sections of the map that would be useful for the creation of metadata. And so here on the left, we see a section of a USGS topographic map with an area highlighted that contains information that would be useful to include as metadata for that scan map image. And so we can see that a rectangle has been drawn around it, which we see represented in red. So that is the area of the annotation on the scan map image. And then above in the white box, we see that it has been identified as an area of metadata on the map. So that is the label applied to the annotation. And then we see below that that the text that is present on the map has been typed in manually by the person using the tool so that that information then is added to the JSON file. JavaScript object annotation file that contains all the annotations for the map. And then this information can be extracted using a Python script and then used to create metadata in the formats that we were looking for. So we're using ISO 19139 XML metadata for our scan map images currently, as well as Geo Blacklight schema metadata because that is the, we use Geo Blacklight to power our Texas Geodata port. On the right here, we see an area of the same map that has been identified as containing latitude and longitude coordinates. So the specific point on the map that has those coordinates has been identified with that red rectangle or diamond shape rather, and we see that the annotation for this has been classified as a label and as coordinates, and the coordinate information has been entered there by the user, the annotation interface. So in this case, this annotation would also be added to the JSON file that is produced once map annotations for scan map images are saved. And then this could be extracted based on the coordinate label from that JSON file and then used to georeference the map using an automated process. As long as we have a few of these points for each map, where we know the coordinates on the earth that that point on the map corresponds to and we know the pixel coordinates in the image, the scan map image, we can relate those two. So we can relate the pixel coordinate in the image to the real coordinates that specify the location on earth and georeference the map using an automated process just based on the annotation. So these are our goals for using the annotation information created through this new annotation interface to develop metadata into georeference or scan map images. This is our overall timeline for developing the interface and getting it to a point where we'd be ready to deploy it in a crowdsourcing environment. So, in the fall of 2020 we were really focused on building out the requirements for the annotation interface from January to August of this year 2021 we developed a prototype of the annotation tool. And between August and September of this year we wrote documentation for the tool and perform testing to verify that it met the requirements they had established the year prior. And so that is really where we're at now we have a functional annotation tool that has documentation, the source code for the tool has been published to get hub. And it's available at the link that we see here on this slide, it's been developed with react.js and Node.js. It's designed to be fast easy to use and customizable again meeting the requirements that we had set out at the outset of this projects, and it can be run both locally on a computer of users here in the UT libraries for instance, or it can be deployed in Amazon Mechanical Turk for crowdsourcing. So the fact that we can write locally and use it in a crowdsourcing environment is really important for our plans moving forward as I'll mention here in a moment. So, very quickly I want to give a short annotation demo so that we can see a little bit more about how this tool works and I'll quickly demonstrate how annotations are created how they're saved in the JavaScript object notation format. So it can be used to process multiple maps and how the interface can be customized and how we can customize the list of maps that is presented through the annotation interface. So I'm going to quickly exit the presentation slides here and switch tabs to show the interface itself. So we see when the interface is started it is running in my browser and at the top of the annotation interface we see that we have a tutorial that users who are using this tool for the first time can reference. This is particularly pertinent. Once we deploy the annotation interface in a crowdsourced environment like Amazon Mechanical Turk, those users will need to review these instructions in order to understand how to use the tool. So it's not customizable. There is a JSON configuration file that is included when the application is downloaded so we can modify this as needed to improve the tutorial over time or customize it for certain types of maps that maybe we're processing in batches. So there's a lot of flexibility here. Those that have finished the tutorial are already familiar with the interface. They're able to go in and start adding annotations using the tools that are available here in the left hand column. So I'll zoom in on that so we can look and get a closer view of things that we might want to annotate. So we're a little limited on time for this prerecorded session. So what I'm going to do is just add an annotation that matches with the example that I mentioned earlier. And so we see that we have the corner of the map has clearly identified coordinates. I'm going to add a coordinate annotation here. I'm going to type in the coordinates that I see in latitude and longitude format. So I'll enter 3, 1, 5, and 9, 4, 5. So I've recorded the latitude and longitude identified this annotation as a pair of coordinates. And then I'm going to click OK. So that annotation has been created. I can also use a different type of annotation identifying a rectangular area of the map. And here, instead of using a point as we saw for the previous annotation, we have a rectangular area that's being identified. And I'm going to identify this as the title of the map and I will type in this quadrangle. And now we've had the second annotation added. And so we can go through it and annotate many different things that we see in each map. So we can annotate the title. We can annotate coordinates. We could annotate labels that we see present in the map if there are things that we want to identify in the map itself. I'm going to quickly pan down here to the lower region of the map. And see some of the other things that we might potentially be interested in here. Again, we see, you know, information that might be useful for the creation of metadata. We have scale information, legend information. All of these things are potentially useful to us and things that we might want to have folks annotate in the map. And our goal is that if we are able to develop annotations for hundreds of thousands of maps, we might then have a sufficient data set that we can use to train a machine learning algorithm to then automatically identify these same types of features and maps that have not been manually processed. And that would save an immense amount of time if we can get to that point. And so that's our long term goal with this project. Now, here we just processed a single map image. And let's say those were the two annotations that we wanted to add here and in a real scenario, we would add quite a few more. But what I would do if I'm done with this map is click the save button, and then down here at the bottom of the screen, I'll click the button to download the results. And so here's the JSON file that I've mentioned, I'm going to quickly open that so we can see what this looks like. And you'll see there's a lot of information in here and that's because all of the maps that are set up to be processed with the annotation tool are listed here so we see a lot of maps listed that we have not yet created annotations for. So look closely up here at the top we see the annotation information that we've entered for this map. So here we see that the Douglas quadrangle that we identified as the title of the map is mentioned here. And this is all, you know, standardized format of JSON we also see the coordinates annotation that we added up here. So we can process this using a Python script and extract the information for the coordinates and the information for things that would be useful for the creation of metadata. And do that in a very standardized way because of how these annotations have been set up so this is going to be really useful to us since we've automated many other similar processes using Python before but we just haven't had great metadata or great coordinate information for our maps to work with. So this solves that problem for us right so right now we're hoping we can collect this information through manual processing of the maps through crowdsourcing and eventually get to the point where we can use machine learning algorithms to do this type of workforce and just very quickly I will show you the other maps that are currently set up to be processed here so we see a variety of other map types that can also be handled by the annotation interface. And then this annotation interface, as I mentioned is very customizable so here we see the list of maps that is currently set to the process. So we can modify the links here. If there were new maps that we wanted to add to be processed we would just add those in as additional items in this list and again this is JSON formatted. It's really easy to customize and modify we also have the customizable tutorial here so if we wanted to add new steps to the instructions we could do that in this area. We can determine which annotation tools are activated when the annotation interface is displayed we can also determine what types of annotation labels are available in that dropdown menu that appears when an annotation is added. So there's really a lot of fine gearing control that we have with this interface and so it's really going to benefit us moving forward. Alright so very quickly I know we're running a little short on time I'm going to go back to the presentation slides and just discuss a final few things that I think are important to mention about this project. So first I want to go over our next steps. So in the current academic year 2021 to 2022. We are hoping to process about 500 maps in house by running the app locally. So we'll have library staff members. We'll go through and very carefully generate all the annotations for about 500 maps that we think are important to record. We then hope to deploy the ML 4 GIS annotation tool and Amazon Mechanical Turk to gather annotations for the same maps that we've processed locally. So we hope to perform a quality assessment of the crowdsourced data using the in house annotations as a baseline so we want to be able to assess how accurate those crowdsource annotations are, can we rely on them to really train an algorithm to identify these types of features that we're interested in in scan map images. And if so, our next step in the following academic year 2022 to 2023 would be to generate annotations for several thousand maps using a crowdsourcing approach. So scale up the annotation creation that way for several thousand maps, and then use those annotations that are, you know, we're assuming will be of sufficient quality to train a machine learning algorithm to identify that those seems type the same type of information in new scan maps that have not been manually processed. We then like to utilize that algorithm to generate annotations for the tens of thousands of maps remaining in our collection that would not have yet been processed, and then use those annotations that are developed using this algorithm to then geo reference our map images and generate metadata for them so they can be added to our Texas geodata portal and made available to our campus audience through that that portal that makes the maps much easier to find and utilize in geographic information system software. I want to make sure to acknowledge all the others that have played a key role in this project. Alyssa Jean and Patrick Chow were absolutely integral to the development of the ML4GS annotation tool and worked with Donna Gerrari from the UTI school to develop the annotation interface that is really core to the foundation of this project. Our project partners at the city of Austin the Austin History Center Ross Clark Jennifer Hecker and Mike Miller for their involvement their feedback on the annotation tool and their contribution of maps from their collections that we have used to test the tool. I also want to again thank good systems research grand challenge at the University of Texas at Austin for the project funding that that made this project possible. And I want to again share the link to the source code if anyone's interested in taking a look at this annotation tool themselves. And I also want to provide my contact information if there are any questions about this project if anyone is interested in learning more about the work that we're doing moving forward with this tool. I would love to hear from you and I would be glad to answer any questions that you might have. And with that, I'll conclude this presentation. Thank you again for watching this really appreciate it and hope you enjoyed learning about our ML4 GIS project. Thanks.