 Greetings, everybody. Thanks for coming. Can you hear me OK? This is probably overkill with the mic. So thanks for having me. My name's Alex Reichev. I work as a civic hacker. I guess that's a trendy name for it. MRCagney, which is a walking, cycling, and transit consultancy based here in Auckland, New Zealand. I also have a branch in Brisbane, Australia. And there I have the privilege of using Python on a daily basis for analyzing and visualizing data, mostly about cities. And one of the tools I often use is GeoPendas. Comes up when we manipulate spatial data. We deal with spatial data a lot. And GeoPendas is my go-to tool for that. And I'd like to introduce you to it through an example. We'll do a little live coding, well, running, live executing of things through a Jupyter notebook. I think an example is a good way to get a taste for GeoPendas or any software. GeoPendas is at version 0.3.0 now. This is their website. GeoPendas is an open source Python library for manipulating spatial data without a spatial database. So if you have smallish data and you don't need a database or you want to do your stuff in memory, GeoPendas is good for that. Anybody played with Pandas before? Pandas, the data table library, great. Anybody played with Shapely before? That's a geometry library, mostly 2D geometry. Put those two together and you get GeoPendas. So basically, you're dealing with tabular data that has an extra geometry column. So you can deal points, lines, et cetera. So it combines GeoPendas, aims to combine the capabilities of Pandas in Shapely, and mostly you deal Geo data frames as they're called to interface between multiple geometries in Shapely. It's about three years old, this library, and I'm going to introduce you. It's pretty mature. There are some performance enhancements that are in the works. It is a little bit slow on some geometric operations. So as an introduction, I'd like to work through a little example, something relevant to Auckland's here. We're going to use some open data to answer the following question. Oh, I should say, all this is on GitHub. I just put it on my repo. So you can work through this example if you want and extend it. You all see that, all right? That's my question. What are Auckland's crashiest roads? So I'd like to use some open data in GeoPendas to answer this question. The question's kind of vague, which is good, because we get to make our assumptions as we go. So I'm going to start, so I'd endeavor to answer this question here. And here's how I propose to do it. First, I'm just going to import a bunch of libraries, one of them being GeoPendas, like that. And I'm going to download some open data. One, New Zealand crash data. I'm going to need that from NZTA, New Zealand Transport Authority. Here's the website I'm getting it from, Disaggregated Crash Data. It's got a few years worth of it. I'll show you what that looks like in a moment. And also, I'm going to use some Auckland Road Geo data from OpenStreetMap. I'm going to get that from MapsM. They have a bunch of Metro extracts. You can get it for lots of places, but MapsM has a nice, convenient extract. They do a lot of cities around the world in Auckland. It's one of them. And I've got a chunk of this orange chunk here of Geo data that we'll get, some pink one. I'm going to get the ipposim.io.json file. So I actually did that ahead of time. You can see my data directory there. I downloaded the data. And I also cleaned it up a little bit, which I'll actually execute in this next cell here. So crash data and road data. Put it together, find crashy roads. That's the idea. So I'm just going to prepare the data a little bit, because the crash data is a lot. I just want to look at the last, sorry, the latest five years. So I'll just do that in these two cells. Technically, I didn't need to execute this cell here, because I already did this. But anyway, we'll get the effect. There's some crash metadata, too, to describe the variables. And it's got some weird Latin 1 encoding, so I got rid of that there, too. Windows. OK, let's explore the data here. So I'm going to load first bit of GeoPandas here. The road data, like I said, is a GeoJson file. That's a standard, one of the most popular file types for geographic data. I'll use the read file command from GeoPandas to do that. So I just give it a path. I'm using path objects here from the Path of Label Library. And GeoPandas doesn't like path objects. They like strings, so convert to a string there. So I read that. And this head command, just like pandas, I'm going to print the first five rows of the data. And there you have the data. There's an OSMID for the road, a type of road, a name, tunnel bridge, other stuff, class. And then this extra geometry column. This is the GeoPandas magic. This is a shapely object, a line string here. Every road has some geometry associated to it. Notice the coordinates here. These are longitude, latitude. That's the standard for GeoJson. Every GeoJson file should have that. GeoJson used to allow other coordinate reference systems, but not anymore. The default is now WGS84, the stuff that comes from GPS satellites. But I don't want to use that coordinate reference system for this work, because for this work, I want to deal with meters farther down in our analysis. And longitude and latitude don't help you deal with meters sensibly. So I'm going to change to a different coordinate reference system here. Anybody got a suggestion? NZTM, the winner. NZTM, a New Zealand transverse Mercator projection. So the Earth is roughly an ellipsoid. And we're going to just take a patch of it around Auckland and then flatten it out. There's infinite ways to do that. They all introduce distortion. But for our purposes, the least distorted method to do that is to use the NZTM projection. So I'll convert those longitudes and latitudes into meters on the ground, easting and northern. And that's super easy with GeoPanas. We'll use this two CRS command, coordinate reference system, CRS. They're defined in dictionaries. So if you just, by default, if you read it in a file that will put it in WGS84, it won't sense a GeoJSON file that is, which is what we read. Here's my definition of NZTM by its EPSG code, which you can look up. And yes. Yeah, so this is the issue. This is cartography 101. And it's an issue that you definitely have to get through with geospatial data. So we need to convert GPS data or WGS84 reference system of where you're describing a point on Earth by two angles, longitude and latitude, basically from roughly the center of the Earth. And so those are angular measurements. And I want to convert them. I want to convert that to a map on the ground. So I want to convert that into a 2D flat plane, X and Y plane, where the units are meters instead of angular measurements. This is important when you do geospatial analysis, especially when you have to deal with distances. Always convert to the relevant projection coordinate reference system for your analysis and make sure all your data is in that one system. For New Zealand, NZTM is the preferred coordinate reference system that you should project to. I'll just say that much about projections. There's a whole bunch of math behind it of just trying to flatten out the sphere. It's impossible to do without distortion. You just imagine taking orange and peeling it and flattening it out. You're going to distort it somehow. Either angles or distances. But NZTM is the best we have around here. And that's the coordinate reference system I'll use so that whenever the units you'll see will be in meters. So I'm going to execute that command to CRS. And I just put a print out here. The current CRS was WJ-94. The new one we converted to is NZTM. Now look at the coordinates. Much different. Those are meters now. Meters with respect to some origin. I don't know where it is actually. Should have looked that up. I'm not sure where they put the origin of NZTM where in New Zealand they put it. But we are. This is like 1.7 million meters one way. It's easting and northern. Easting and then northern and a few million the other way. OK. Geopand also has a simple interface into matplotlib. So I'm going to plot the roads here using the plot command. So again, roads is my variable I'm storing as it. That's my geodataframe name, roads. I plot it out. And I colored it by column type, which in this case is like category of road, primary, secondary, et cetera. So there's a funky picture of the roads we're dealing with here. I should say they are generalized roads. That is, the resolution is, I wrote it down up here. Was it 20 meters? 50 meter tolerance, the resolution. So these roads will not be bang on where they should be. That's something for further exploration, put at the end. OK. We have our roads data as a geodataframe in NZTM coordinates. Great. Next, let's reload the crash data. This is, if I load it up, I'm just using pandas here. It's a CSV file. I'm reading it in with the pandas command read CSV. Here's what the first bit of it looks like. I'm taking the head, I'm giving the first five roads, but I'm transposing it so you can see all the columns here. This is the data, crash year, crash financial year, crash severity, fatality count, et cetera, et cetera, et cetera. Lots of things. Was it dark? Yes, dark. The weather, lots of things in this crash data. In particular, you see it here. We have Easting and Northern. That indicates that we're dealing with NZTM coordinates in here. But this is not a geodataframe yet. This is just a dataframe with pandas. So I'm going to convert this. Now, sorry, I'm going to do a little filtering. I only want to talk about Auckland here. This is for all of New Zealand's crash data. So let's filter some of that out here. Just use some pandas commands. I'm just going to take the region and just the rows that where the region contains a word Auckland in it. And also, there's some funny things where you have negative Northern or negative Easting, which is clearly like a data problem, because those things should be positive in this setup. So I'll just filter this out. And now we just have the Auckland data frame. Oh, by the way, that crash-sev thing from the metadata crash-sev is the severity of a crash. Possible values are f for fatal, f for serious, m for minor, n for non-injury. It's not going to be relevant in this case, but it just means that X and Y coordinates and X is Easting. How far East and West you are, how far North and South you are. So now we're on some flat map. It's like you would see on a piece of paper. The important thing for us is that all the units are measured in meters here. And this is good. Feel free to interrupt me if you have any questions along the way. OK, so I want to convert this data frame into a geodata frame. So I want to do some geospatial calculations here. So I can do that by creating my own geometry column. Remember this data, if you saw here, I pointed out, it has a Northing and Easting column already. It has the geospatial data in there. It's just not formatted as we like. So these crashes are points. This is a certain X and Y chord. So I'm going to turn the geometry on. I'm going to create a geometry column. I'm going to do this by using, I imported the shapely.geometry module as SG. So I'm going to create a point out of those two columns, the Easting, X, Y, and Northing. Sorry, Easting, X, and Northing, Y coordinates. I'll create a point out of them. I'll create a whole list of points for all of them. And I'm using the geopandas command. Create a geodata frame from F, which was a data frame. The coordinate reference system is NZTM, which is what we want, which is what we put the roads in. And the geometry is the geometry column I just created. I do that. And we're looking at transpose here again, just so we can see all the columns. And look, we have an extra column now. That's the geometry column from geopandas. It's a point. The roads were line strings. These crashes are point geometries, both in the same coordinate reference system. Again, if you're ever doing geospatial analysis, make sure your coordinate reference systems are the same for all your data sets or else it'll be meaningless to what you do. Well, geopandas will give you a warning for some things that they're not in the same coordinate reference system. So that's good. That's some check on what you're doing. But the geopandas just thinks it's numbers. So if you see a point in the coordinates are 174 comma negative 37 versus 5 million and 2 million, then there's not going to be any intersections between the two. You're just going to get nonsense answers, probably. Where there should be an intersection, there won't be an intersection. There should be overlaps, there won't be. So just a warning for that, the first rule of spatial analysis. Just made that up, but it's important. And let's just double-check that by plotting that we're sensible here. I've plot the crashes on the roads. Again, I'm just using the geopandas plot command, which is really mapplotlib running in the background here. Crash is in red for blood. Now, I shouldn't joke about that. But here we see that, yes, we're in the same coordinate reference system. Makes sense. Pretty crashy around the isthmus. Yeah, what's up with that? So these are not necessarily what are data issues here. So we could be missing some roads in our data set. Some of our roads are generalized. Oh, I'm told not to point like that. Some of the, I mean, this is probably not really a straight line, because some of them are, the resolution is low 50 meters. I haven't dug into it, but I'm guessing these are just small roads that aren't in our data set over here. Always good to do some visual checks on your data, if you can. OK, so let's address our question here. The question was, what are the crashiest roads in Auckland? What do you mean by that? This is my proposal. Tell me if you think it's reasonable. I'm going to look for every crash, sorry, for every road, technically every way in OSM data, I'm going to get assigned to it the crashes within five meters of that road. Because as you see, some of the dots are not quite on the road. They're really close to the road. But if it's a footpath of a pedestrian, got injured or something, there's some fuzz in there. So for every way, I'm going to take all the crashes within five meters of it and assign it to that way, that road. OK, I should say, there's a difference between ways and roads in open street map data. Going back here to the open street map data, there's an OSMID, and then there's a name. You can't see it, but there's a lot of repeated names. Like Simon Street, for instance, has several sections. Those are the ways, and they have different OSMIDs. So in my first cut of this, I'm going to look at the sections of the road, not what we would call a road, but the ways, the snippets, the OSMIDs. And that's what I'm proposing here as our first go at answering this question. Take all for each way. Take all. Look at all the crashes within five meters of that way and assign it to that road segment. Then that'll give you a crash count for each road segment. That's the first step. We at least want to know how many crashes are on each segment. So I'll do that. And in doing so, introduce this concept of a spatial join. This is where the magic happens. Oh, sorry. Yes, I'll do this. Spatial join, gpds join. So if you played with pandas or tabular data, you know you can merge two tables together on a common key or a common key. This is the geospatial equivalent, spatial join. Instead, we're going to merge two tables instead on a common key on something geometrically meaningful. And the default is intersection. So I take one table, which is our roads table. And the other table crashes. And I'm going to intersect them. That is, I'm going to find all the roads that intersect with the dots. All the line strings that intersect with the dots. Well, not quite. Because I said I want to take the dots that are within five meters of the road, not just the ones that are directly on the road. So I'm going to perform another geometric operation first. I'm going to take the crashes data. And I'm going to buffer it each point by five meters. So now each point is blown up into a blob. And now if a blob intersects a road, that means that that point is within five meters of the road. So that's what I'm doing in this cell. Buffer the crashes, intersect them with the, spatially join them with the roads. Spatial join is usually an expensive operation. And this is where geopenas can benefit from some speed ups. For instance, if you use QGIS, QGIS would do this faster. Well, I don't know about in this small example, but definitely when it gets bigger, QGIS can do it faster. What's going on under the hood? There's a library called Archery, which creates a spatial index to do these fast queries. So you're not going through a for loop and saying, here's a road for every crash in the world. In my data set, is this five, five meters? Does it intersect this one? Does it intersect that one? No, you can, if you use a spatial index, you can narrow your search down and just pick the ones that are nearby, in a sense, to investigate for actual intersection. So I did that. And what results is the roads that intersect the crashes. And this is the, I just, I notice I took down the, I chopped down the crash table to just the geometry and the severity of the crash. So this is what remains over here. And the index right refers to the index of the crash table, just in case you want to go back to the crash table and see what's up. So you'll have roads with multiple intersections. Let's see right here. Nielsen street has an OSMID with a crash and here it appears again with another crash, a different crash. Yes, so this is, now you can raise a methodological objection. Like, wait a second, what if there's a crash that's needed by an intersection? You know, within five meters of every road, now you're counting it, you're double, you're counting it for every road. Yes, you can argue with that. That's good, maybe that's bad. That's a methodological thing. But yes, one crash can count to multiple roads now. So we have this big geodata frame of roads with their crashes on them. Now I want to assign a crash score to each road so I can compare the roads. I'm going to do something simple. I am going to use the number of crashes per meter of road. I could just use the number of crashes. That's a fine metric. But I want to illustrate another geopandas feature, so I'm going to say, let's use the number of crashes per meter. Okay, somehow normalize for the length of the road because the road is really long. You know, you shouldn't be comparing a super long road with a super short one, which is a number of crashes. Let's normalize somehow. So, first, I'm going to aggregate the counts. So, I'm going to use this data frame up here. It's got the roads repeated. One, for every hit, you got the road coming up again and again in the roads. I want to group by the road and then just sum up the counts, the crash counts. So now I'll have a table of roads and their crash counts. And that's what I'm doing here. I'm just doing a group by the OSMID. This is a pandas operation. And I'm going to sum the crashes, take the first name, and take the first geometry. And, hey, this is something that I don't like about you, pandas. Sometimes when you perform group by operations, you lose the geometry table, not the table. Sorry, you lose the coordinate reference system. I think this is a feature that Geopandas needs to work out. So let me execute this, see what happens here. So, this is my little clue here to put the coordinate reference system back into the Geo data frame. The Geo data frame, which was F, became a data frame after my group by operation by pandas. So I lost the Geo nature of it. So I was like, ah, that sucks, let's put it back in. So I had to recast as a Geo data frame. And this part here, I'm going to add a new column to G, the length of the road. And I perform Geopandas as all the shapely geometry commands that you can do. So anything in the geometry column you can perform a shapely operation on it. One of them is length, which makes sense for a line string, which is a road. So that's what I'm doing here, I'm saying, compute the length of the road. Now set a crash score. The number of crashes divided by the length. And now sort by crash score, highest to lowest. And I'll save it to a file for posterity. I'll get back to that saving file for a moment, but let's look at the table. Massey Road, crashiest of them all, according to our metric here. So we see is OSMID, the name, the number of crashes, 44. This is in the last five years, last five years. The geometry, the length of the road, pretty short, five meters or so. And the crash score, which is quite high above these other ones. I'll plot that in a moment. Let me just go back to the saving a file, another Geopandas command. I wish they, if you want to save to a shape file, there's just, there's just, you just write to something.shape and it works. But with GeoJSON, for some reason, you have to, the save file command doesn't work nicely with GeoJSON. Another weirdness of Geopandas. So I have to open a file and write to JSON. So what you see over here is, one, like I said, GeoJSON coordinates must be in WGS84 according to the spec. So I'm converting back to the coordinate reference system, longitudes and latitudes. And here, this is the Geopandas command too, JSON. I'm dumping, I'm converting the table into JSON. It'll be GeoJSON, so it'll have the point coordinates in it as well. And here, I'm just writing to a file. Let's save that for later. And here are our roads. The crashiest road is the mass road, interesting. Let's see where that is. So the tables are nice, but humans are visual creatures. Let's draw this a bit. I wanna color code the roads by the crash score. I'm gonna use this library called spectra. It just makes coordinating, manipulating color scales a little easier, kind of like D3, I don't know if you use JavaScript, D3 color scales are brilliant. This is the closest thing I could find in Python. Spectra library. So I'm gonna choose a spectral color theme, which is some colors that I got from Color Brewer. And I'm gonna make some cutoffs based on quantiles. So I'm not getting a color code by the score per se, but the quantile of the score, where it fits in the distribution of scores. And I'll make some marks, zero to 25th percent quantile, 50th, 75th, 90th, and 100th. So if it's really red, it's really bad. And I'm gonna do this in LCH color space as an aside. That's what you should do when you make maps. Use LCH color space, because apparently that's the best for if you wanna create a gradient that humans perceive as most different across a color space, you should use LCH rather than RGB or something else. That's something from, what do you call that? Psycho something visualization. There's a whole science behind that color theory. Okay. And here are the quantiles, by the way, of the crash scores. Or I mean a summary of the crash scores. You can see our crash scores are mostly pretty low, except there's this crazy one. I mean, there's some crazy ones, Massey Road and others up there. Let's put it on a map. I'm going to use another library for this. You saw the plot with GeoPenna's mapplotlib interface up there, the plots that I've been using so far. That's pretty good for a quick look, but I want to pan and zoom into the roads. So I'm gonna use Folium for that, which is a Python library that wraps the leaflet JavaScript library and plot that out. I'm gonna style based on the colors that I'd set up there. And I'm zooming the center of the map is the crashiest road. I did that up here. It's at the longitude or the center. And Massey Road, what we're dealing with is this little section over here. That's the crashiest section of Auckland. What's going on there? Has anybody been that way? Is there anything particularly strange about that intersection? Uh-huh. So that's sensible. That could be dangerous there. Let's zoom out and look at the red bits. Ooh, lots of red bits around the CBD as well. So again, we're assigning these crash scores to the little sections of road. You might want to do something else and assign it to the entire road instead of the sections of road so that you can talk about, not just section of it, but the name of the road, the whole road itself. So that's our crash map, the crashiest road seems to be Massey. And what was the second runner up? The runner up was Moai Oro Street. And followed by Massey Road again. And then Simon Street is probably, you might have guessed, around there. Oh, that's the length in meters. The length of the section of road we're dealing with. So Massey Road, yes. So this is not Massey Road, but as you know it, Massey Road is the name of the entire road, but the OSMID, the section of the road we're looking at is only five meters. So it's five meters and Massey Road seems to be the crashiest according to our metric here. But look, Massey Road appears again, but that's a different section over at Massey Road, a 7.8 meter section of it. Yes, exactly. So you might want to refine our analysis here. So I want to do a starter to introduce our basic Geopandas concepts here, but here's some ideas for further exploration. You can use better resolution roads, because if you look, some of these roads, they're not following the base map. The roads are great resolution, 50 meters. You could do the analysis instead on the road name rather than the section of road, the road ID. You could, how could you do that? You could dissolve the roads into one. So you could group by, there's a dissolve command in Geopandas to put all the roads together. Yes, that's you, you gotta be careful of that. So if you dissolve, but dissolve, yes, actually that will be a problem. So if you have multiple queen streets, they'll be dissolved into the same multi-line string. And so you'll be getting scores from both sections, which you don't want. So you gotta be more careful. And another thing we could do, we could just choose a totally different metric. You should consider severity, for instance. It might be an objection. Look, come on, a death, a fatality is worse than a non-injury, you should be factoring that into your crashy score meter. Number of crashes? Number of crashes? So yes, yes. Yeah, so we counted, we counted, we assigned all the crashes to the sections of road, and then we normalized, we divided the number of crashes by the length of the road. And that's what, that was our crash score. Yeah. Ah, yeah, see, you can take advantage of that. Mm-hmm, that would help with that. And you could also say, forget crashiest roads. Why don't you just look at the cluster of crashes instead and make a heat map instead and talk about crashiest areas instead of crashiest roads. Yeah, do you have a question? So I put that in ideas for further exploration. I want you, if you want, and if you feel inspired, you can take this notebook and extend it in the ways we're talking about, yes? So one of the suggested metric would be the lethality of crashes, how many lethal crashes out of total crashes? Yep, could do that too. So forget the non-injuries or the non-serious ones, just count the faculties. Not forget them, but the user myth is that that's the baseline number of crashes, the total crashes, right? Because some places would be more dangerous for crashes than others. So the crash is more likely to be, like coming off a motorway onto an intersection, you're more likely to be lethal if you come into a tensed section than if you're in a low speed area. So you might have. Right, so you're saying weight by severity somehow. Yeah, so the lethality of the crashes occur there. That's a good point. That raises also some sticky questions, like how much worse is a fatality from a serious injury? Depends. Number of fatalities divided by number of crashes. So how would you do this to convert this to a heat map? Because then you're not just dealing with, like when you start overlaying a bunch of crashes all in the same thing, how would you weight that spatially so that you can see that as a hot area? Yes, that's a good question. So geopenas does not have a heat map function for you, but there are other libraries that do do that. And I'm trying to think of which one does. The sci-pi have something like that. Because when I made heat bats, I've done it with Java script. I mean, I don't use heat maps that much. So geopenas, that's a good point though to bring up that geopenas mostly just covers the core of spatial operations, like reading file IO and reading files and stuff, converting coordinate reference systems, buffering points or computing lengths and areas, spatial joints, overlays. Like if I overlay two things, I wanna find out like the unions or the intersections of those overlays. So basic GIS operations. It doesn't include anything fancy like heat maps. That's the stuff you either have to do yourself or use another library to do. I think that's intentional too. That was their aim to start out just with a basic core of operations. But the heat map, if anybody does do a heat map, I'd say issue a merger request. I'll put that in, because I'm curious to see what that looks like. I wanted to do road stuff, so I wanted to bring in the spatial join operation. Just illustrate that. Any other questions? Yes? Time, like, time of day? Mm-hmm. Yeah, like a time series view of the crashes. That would also be good, yep. So looking at the time of the crash, so I'll just repeat for the record, the crashiest time, but also if traffic's getting worse and people are taking longer to get home, does that affect the timing of crashes and the lethality of the crash and the frequency of the crash? Ah, right, how does the background traffic fit into that? Yeah, that's a good question. Tough data to get though, not open data, and there's no, or open, I should say open, as an aside, open traffic data is a new project. There's not a lot of it yet, but it's coming. Most of it's proprietary, like Google Maps there. You know, the red, the yellow, the green, they don't tell you what red is. It just takes a long time, how long? But anyway, that's in the side. Thanks, though. Any other questions you have about geopandas? Cool, well, I hope that was a meaningful introduction and you can appreciate some of the pros and cons of using geopandas in your project, if you have one. Thanks again for your time.