 I'm very excited to present our next speaker, Mauricio is an incredible developer currently working at NYPL, the New York Public Library Labs team, where he does some incredible design and development work, and I think we'll all really enjoy it, so give him a warm welcome. Hola, my name is Mauricio. I work at the New York Public Library, which is a research and circulating library system that spans three boroughs in New York City, the Bronx, Staten Island, and Manhattan, 90 branches and four research centers. But I particularly work in this spectacular building. It's pretty cool going every day, it's the Ghostbusters building, if you've seen it. You should check it out whenever you're in New York and say hi. My team is called NYPL Labs, we are a small group trying to find new ways of engaging the public with the collections the library has. This is what we look like, this is an average photo of us, a pretty white lack of hair, but a lot of facial hair, so it would be cool to change that. So I'm going to talk about maps, more specifically about the great map data extraction and adventure in three acts. But first the prologue. The Lionel Pinkus and Princess Ferial Map Division in the New York Public Library headed by this man Matt Knudsen, he's the geospatial librarian of the New York Public Library, that's not the most awesome job title, I don't know what is. He heads this division and we have all these, this is the map room, his office is behind those doors you see along with all most of our maps and these are you know what maps look like, a bunch of books, some sheets also around and so we have maps, we have the garden variety old map with incomplete continents. We also have these proposed maps of expansions of Manhattan in the Hudson River. We also have maps engraved in ostriches in 1500, ostrich eggs. So this is apparently the earliest known depiction of what Europeans call the New World. So in all 500,000 plus maps, 20,000 plus books and atlases which have been recently released as the digitized versions of which have been released as public domain or actually CC0 license so you can download high resolution versions of these maps and do you know your punk band flyer, so important scholarly work. But I'm going to talk about a specific set of maps, it's the Atlas of Insurance atlases back in the day and not so much today but these are the 1800s and mid 1800s, surveyors, insurance companies would send surveyors throughout cities across the US and Europe to draw each one of the buildings and color code them and do add some extra metadata which would inform in the case of an accident or something burns up to estimate the value of the property and insurance claims and all that. So it looks like a very simple map, it's a few rectangles and things like that, letters and colors but in it there's lots of data, we know the year of the map, we know street names, we know the use type of this building, commercial or residential, we know the name, you know some cathedral or some slaughterhouses are named in there. The hotel of the building is color coded so wood brick, brown stones, the class if it's high quality wood or low quality wood or whatever and the address of the building, see what else, if it has skylights in the roof this is important for the fire department to know, backyards and back lots, of course the geographic location of the building on earth and instead the building itself, the footprint, the shape of the building. So they were made, so they would draw one, let's say today we draw one map and if something changes they would not draw the whole map again, they would just cut up and overlay and make the changes on the map and publish a new atlas. So this went for several decades until the mid 20th century and so there's a lot of data trapped in a legacy format. So we would like to get all this data, right? But why would you want this data, right? So to get an idea, this is what some projects of other institutions have done, this is every building in the Netherlands color coded by year of construction, it's an amazing visualization it's a map, you should check it out, this is the building patterns in Paris starting from the 1800s, every building created at the moment of their construction and it's a, you know, another pretty cool use of historical data. But you don't have to go to the 1800s to find historical data, Google started taking pictures of your environment in 2007 and they just recently released a feature in Google Maps. I haven't found it, I just saw the news about it, that you can overlay, you can see the different snapshots that Google cars have taken of a specific part of the city so this is just like an animation of the construction of the World Trade Center building. So yeah, back to our atlases of insurance maps. You can see these are very simple geometric shapes, right? So our first effort, this was, I personally had nothing to do with it, this is like three or four years old project, we did this web tool to let people or any user of this web tool to give in a sheet of an atlas to make it geo-aware, basically place it on top of open street map, start tracing by hand all the buildings and adding the street names and all that, basically adding the data or extracting the data by hand out of each and every single sheet. So it is a very time consuming process, you know, you have to do a lot of work to extract a single polygon and everything, it's a lot of work, you know? So you know, given the amount of maps we have, this probably by hand will take some while, right? It will take a while to extract for, you know, I don't know how many hundreds of thousands of pages we have with buildings in them. So back to our map, one step is digitizing the map, right? So before digitizing, if you didn't happen to live in New York City, you basically had no access to the map. You needed to go physically to our room, ask for a map, open it, look at the page you want and see your building there, but you, digitizing is one small step towards bringing these maps out, you know, this is what a map looks digitized, you put it on a thing, take a photo, you know, composite the two left and right pages of the book and there you have it. But this, you know, this is what it looks close by, it's a very plastic thing, you can see that they are hand colored, they are different shades of green, there's a page that has been overlaid on top of the other one and then on a second page overlaid on the previous one. So the color coding varies slightly different, lose different reds anyway. So the first part, the next step would be to make it geo-aware, the scan page won't, if you're standing in the middle of Boston, you won't know if the map, a given map is in Boston or not just by the scan of it. Maybe the person, maybe it depicts a block of Boston, but the person who added the metadata didn't specify that it's a block of the city of Boston, just block X, you know, the building, this building. So, so, georectification is basically, this is an American tour projection of Manhattan, of a block in Manhattan, so, which means it's a given block that has been scanned, is distorted somehow to make it match with, with open street maps, the math is more complicated than just scaling and rotation, just as an example. So the next step would be to trace every single building in that, in that page, you know, which is basically doing this, right, which by hand takes a while, right? So how long, about 120,000 footprints were produced in about three years by staff and volunteers, so there has to be a better way, ah well, 120,000, oh that's a lot, right, this is about just Manhattan and Queens for one year, 1853. So, so, so yeah, you know, there has to be a better way, right, and so we need to find out if, you know, if this is possible. We went, we, lots of companies and people come, go in, in, through the New York Public Library and every time we get an opportunity with somebody who has done some work in GIS, we ask, hey, I mean, these maps are pretty, like, simple to do, is there something you can share, you know, do you know a way of extracting automatically the data out of this and, you know, nobody would really, yeah, we'll get back at you until we run into this guy, you know, you might meet the, know the guy in the left, he spoke earlier today, John Resick, we're talking about his, he was talking about his Japanese work and, and you're like, hey, do you know, we want to digitize these maps computationally, do you know, if that, who has done that, like, I don't work with maps, but my brother, he works in archaeology and I've heard he's done some map work, so let me get in touch, you do in touch, we send a few emails and he sent us back, like, kind of like a workflow, it wasn't, it was like, if it were to be computationally done, it could be done like this and, and there was some software involved, a lot of hand meddling, but we were like, could we automate this in some way? Right? So, well, why not, right? So, before I go further, I would, I mean, all the code that is going to be mentioned here is going to, is, is in public repositories and, and this presentation itself, I'm going to publish on the web somewhere, so I'm going to tweet it, so, so, you know, just don't be worrying about, like, writing down exactly a URL that, you know, it means the site, I'm going to go really fast, so, first of all, we need to define what the building is for our specific need, for this specific map, this doesn't, this won't generalize for every single map in the world or whatever, you know? So, this is an example of a snippet, so it's completely enclosed by black lines, dashed lines, do not count as walls, has to be more than 20 square meters area, less than 3,000 square meter area, I will talk about why that is the case, and it has to be not the color of paper, right? So, the process itself is, is in this repository, and I did a paper on it, which I had the opportunity to talk about in the ACM geospatial conference, which is pretty cool, and this is basically a Frankenstein of spaghetti code that is articulated by a Python script, you know? Detail our GIMP image magic open city, this is our reference map, this is one base image we're going to work with, so the first thing we need, we're going to go through our checklist, right? Oops, sorry. So, it's, it's a completely enclosed by black lines, and dashed lines do not count, so we first of all need to provide the best possible input image. In, when, as you can see, the, the building, this is clearly handmade, there are some lines missing, there's this collage happening, fading paper, all that, so there's also the wear and tear, and fading colors in sheets, so these are all limitations we have to work with, and also in the process of georectification, which distorts the image and creates new pixels and destroys others, and this process of destroying and creating pixels is called resampling, and, and the, the different algorithm, resampling algorithms will produce different results. In our case, cubic resampling was better, as you see, there are broken lines down in the right-hand side, and we don't want that based on our previous definition. So, the first thing, once you've done that, the first thing you would do is make a binary bitmap, which basically means make it black and white. You provide a range of colors, and any, any, any color above this value will be black, any color below this value will be white, the image needs to suffer some additional transformation, but this is what we call a threshold image. And then, we proceed to polygonize this image, which basically means running a Python program on it, and producing a vector value that is something like this. It's a very rugged little polygon, which, what it, what it does is just takes everything that is white, think, extracts a vector out of it, everything that is black, continuously black, think, extracts. So, that's where the area enters into question. Those little circles would be polygons, and the circles inside the circles is another polygon, and that little thing up there is another one, polygon, so we need to exclude all those, and the streets themselves, as you see in the edges, are also polygons. So, we need to exclude those polygons that are too big or too small to be buildings, and we need to simplify it, right? So, we need to exclude these buildings, so giving that, right? So, this is one example of those buildings, or those polygons, and we would like something resembling this shape that is in red. So, this goes through a process that, you know, that is an alpha-shaped convex hull with a sample point set, which, if you are interested in that, there's some links for that, which basically means that you have a given set of points, and different parameter values would give you a tighter fit to those, an enclosing polygon to those points, and a wider parameter will give you a less a single polygon, but with less detail, so we want a point in between, we want the detail of the polygon, but we want only one polygon. So, we need a set of points first, we extract a set of points of this guy, using this function in R, we can extract hexagonal grid, a rectangular grid of points, or a random grid of points, and then we actually do all of those to find, and it's like kind of brute force, because not every time we will be able to generate a satisfying shape, so we go through all of them, and then we alpha-shape this set of points, and this is an example where we lost that L shape in the process of alpha-shaping, it's an intentionally bad example to show you what could happen. There are other point reduction algorithms, we didn't know this then, we know it now, if anyone wants to collaborate with that, it's open repo, we will be definitely appreciate that help, so now we need to know what's a building and what is not, right, and this what we do is try to find anything that is not paper-colored, so what we do is, given the resulting simplified polygon, we cookie-cut that polygon from the map, find the average color, and compare it to a list of approved or unapproved colors. If it's closer, it's at just a basic Euclidean distance, if it's close to paper, we ignore it, which the end result is something like this, given this image, the resulting polygon, so it's not perfect, but this is good enough for us, some polygon is better than no polygon, so as you get an idea, this is a bigger part of New York City, this is the result before, after, before, after, so there are some false positives, some false negatives, but overall it's a pretty decent result, very encouraging. Another one, we lost a whole block of brownstones there, so check list complete, but also we did a little bit of computer vision to try and find little circles and crosses in the buildings, those mean something, we have also false positives there, but this is very primitive work, if somebody knows better computer vision, that would also be a huge plus. In the end, this is what it looks like, you just say vectorize, you give a image, a georeferenced image, press enter and wait 10 minutes, and you get a shape file and a geojson file, right, so in this process we managed to produce 66,000 footprints in one day, this is this machine running linearly, I think I had two processes going on, but so you get an idea if you add, this can scale, and the fastest we could produce in this machine was for a sheet of Manhattan blocks, which is about six or seven Manhattan blocks, it's 10 minutes of processing time, this is for an 1859 Atlas of Manhattan. Some caveats at JCC is not enforced, so buildings do not fit each other very nicely, you know, they're a little separate, there are false positives and negatives, and some, there are some overlapping cases, but you know it's not so bad, so now the second part, we need to see if this is good data or not, right, and before, we are hugely inspired by the work of Suniverse, and I don't know if anyone is familiar with them, it's a group that is in, should be based in the Adler Planetarium in Chicago and Oxford University, so they, what they do is scientists have lots of data that they cannot computationally process, and they go to Suniverse and tell them, hey, we don't have enough PhD people to go through all this data, but you know, you have access to thousands of people, show them this data, and let's have the crowd classify the information, and they receive things like this, this is Planet Hunters, and for those who are educated in astronomy, they will find a pattern there saying like, oh this looks like it could be a planet or not, right, and you, and multiple people, this way they found a binary system, I mean you should go, I won't do justice describing it, so go to planethunters.org and Suniverse and check it out, but this was a huge reference for us, basically taking the task of like a scholar or a very specialized person and cutting it down into bite-sized, fun-sized pieces to then have many eyes going through the same piece of data and let a consensus surface on its own in some, with some mathematical criteria, for in our case it's validating these footprints, knowing if anything is, if this is correct, or how correct, or how incorrect this is, but the question for us was are people willing to spend time checking building footprints, right, this is not the most, I mean this is not planets, or you know, so this was something that we wanted to test, right, so we made the building inspector, this is the Urella, this was a few months ago, the code is here and this, basically what it looked like, I am showing you the compressed like a version, like how it looks like in a smartphone, you are given a polygon overlaid on the on the map itself and you are asked to tell you, to tell us if it's correct, if it's incorrect, or if it's almost correct, just need some fixing, right, there's a, you have a count of all the polygons you've checked and you can tweet that number from there, right, so another thing people can do was going to see their progress, so okay these are all the buildings I've inspected, let's check them out, and you were shown, you know, all the little buildings, the greens are yeses, the yellows are noes, the reds are fixes, the reds are noes, the yellows are fixes, and then, and they could also go and see what the consensus is, looking at all the inspections for every building in the data set, and this is the consensus view, so three or more people have looked at these polygons and 75 percent of them have agreed that it's a yes, a no, or a fix, so seeing lots of green is good, and about a month later, we saw things, this is during a month, you know, things like this, right, this, right, so so overall about 420,000 flags have been produced, a flag being a yes and no, or a fix by a person, for 70,000 plus unique polygons, and the important part, 84 percent are yes in consensus and 7 percent are fixes, so it's, for us, it's pretty good results, and so basically, you know, these blue are mines, you know, people are willing after all to contribute, right, because this is all like free, there's no other incentive, you know, to do this, so well, now our final act is given that people are willing to go through this task, we have a lot of stuff we want to extract from these maps, right, three names, all these, so what we did was informed by these acceptance of our building inspector, the map curator, the geospatial librarian, and as we defined like what would be like a way to divide and conquer these tasks and select three tasks out of these, what would be the most informative, what is the most precious information that we would want to get out of this dataset, right, so we decided that the address was very important, the material the building was made, which is in the color of the building, and the footprint itself being fixed, then the fix is needed to be fixed, so while the flow goes something like this, a building is checked, if it's a yes, it gets sent to an address and color task, or a color task, and if it's a fix, it gets sent to a fixed task, which then would be sent to a yes task, to either of the address or the color, if it's being marked as no, get ignored, so we needed to do interfaces, so the simplicity, one of the things we think was part of the success of the first version was the simplicity, just yes, no fix, yes, no fix, you could that, that's what we think was, so we needed to do that for these other three tasks, how to make an easy address, how to make an easy color selector, and how to make an easy fixer, so these are videos of the new tasks, you get presented as a fixer, so you get presented a polygon with some control points, and you you know, you get to delete, add new control points, and move it around, and save, right, this is a case for a building, for a polygon that encloses three buildings, so you would select the multiple polygon option, and you go and you know, move it around, I should have done it in faster speed, but, so I wanted you to show you one thing, so you add it, and you get shown the polygon again, and you can add as many polygons, some polygons enclose five or six buildings, and this is the addresser, you see a number next to the building, click on it, type it, say you see an address next to the building, click on it, type it, next, many, many buildings do not have addresses, and many buildings have multiple addresses, so we want them all, you would click on all the addresses you see, and go, right, please try it, you know, contribute to the, you know, ah, well, and there's one more task, which is classifying color, which unfortunately is not very colorblind friendly, or not at all, colorblind friendly, these colors are faded, and you know, but anyway, you would select, click on the color you see in the, in the building, being enclosed by the polygon, if it's multiple color, you select multiple, and select as many colors as the building has, and go on, the textures of the buttons are based on the, on the map itself, so you can drag it, and see there's some problems in distinguishing the blue from the green, so, so anyway, classify color, the resulting data is available via an API, and, and you know, it looks something like this, it's just GOJ Son, that you can then copy and paste, and do whatever you want with it, hopefully lots of interesting research, like producing cool t-shirts, and so this is not the end, the, the idea would be as John mentioned earlier, to, for example, extend these tasks to, for people to do it in the subway, or you know, there are many more data points that we need to extract out of the building inspector, but this is just a continuation of this story. Gracias.