 Welcome to Exploring OpenStreetMaps data in PostGIS tools and queries. This session will introduce the tools concepts and basic queries needed for working with spatial data stored in PostGIS. My name is Lindsay Hooper and I'm one of the conference organizers as well as the moderator for all the Postgres conference webinars and digital trainings. I'm here with Ryan Lambert, the owner of Rustproof Labs. Ryan's been working with GIS since 2011 and got his start when on a quest to update a roadmap he started using PostGIS, Postgres, and OpenStreetMap. He has since been a contributor to the OpenStreetMap project since 2015 and he's spoken on PostGIS and had it run on both small to large scales. He's currently working on a book on how to use PostGIS and OpenStreetMap together. So with that, I'm going to hand it off to Ryan and Ryan take it away. For the introduction and thank you to PostgresConf for hosting this webinar series and thank you everyone who has joined us live for the session today. I am Ryan Lambert and I'm looking forward to sharing more about what I know about OpenStreetMap and PostGIS. So this is part three of a six-part series. We're about halfway through after today. As each webinar is completed, I'm processing the video and getting it published. So if anyone didn't make it today, they will be able to see it in the future in the recording. The first session was a high-level overview of the technology and the data that we're working with to really set the stage. And then last time we spent the whole session talking about getting the OpenStreetMap data loaded into PostGIS and ready to use. And so that's where we've left off in the series, is getting, being able to use the data and today it's time to get our hands dirty. The goal of today is I have a handful of slides, but the most part of it is going to be live demo of code and live demo of some pretty pictures. We're also going to talk a little bit about data structures. It's part of the challenge of working with OpenStreetMap, is dealing with the data structures and getting it ready to use. So we're just going to go through all of these today and leave it off for the next in the series. So the two main tools we're going to talk about today are Dbeaver and QGIS. Dbeaver is a cross-platform open-source SQL client. It is also a little bit of a GUI. If you're used to working in PSQL, Dbeaver is going to be considered a GUI. If you're more of a GUI person, you still see a lot of code. So it might not feel quite as much of a GUI as you may be used to. QGIS on the other hand is full-blown desktop GUI software developed for the purpose of working with GIS and performing GIS analysis and visualizations. So these are the two main tools that we're going to be talking about today. Starting with Dbeaver. Dbeaver is one of the big benefits that this tool has, is it has a built-in spatial viewer. The spatial viewer is what allows us to view the GIS data on a map instead of just the raw bits and bytes that make up the actual data itself. This is a good tool for when you're developing your queries, when your performance tuning slower queries, and when you're figuring out what kind of analysis you want to do. This is a really great environment for writing a lot of SQL code, iterating through your queries, and getting a finished product done. Some of the downsides for using a tool like Dbeaver for GIS is we're really limited on our GIS output. How well can we style the data and export it is one of the limitations. And if you're not already fluent with SQL, that might also seem like a downside. You've got to learn some SQL to get in there and get your hands dirty with Dbeaver. We're going to jump into our code examples now. So I have Dbeaver here loaded, and I have a script file set up that's going to take us through some examples, and I've already connected to a database. This database is a local server, local virtual machine with four CPUs, eight gigabytes of RAM and SSD drives. I'm printing the versions here of PostGrace and PostGIS, not because any of my examples use anything specific to these versions, but just so you're aware of what versions I am using at this time. So we are using PostGrace 12 and PostGIS 3. In a later episode of this webinar series, we will go into depth into more details about what is new with PostGrace 12 and PostGIS 3 that opens up some new possibilities with the technology. To start with, I always like to get an idea of the data I'm working with. You'll see me referencing a few queries that have these DD dot views. These are just some views that wrap around the underlying system tables in PostGrace that allow me to query and see the information I need. If you are working in PSQL, you'd often use this slash DT to see your tables and get the introspection that you need into the data. Being in this tool, I like to still be able to query my data and get these out, but these are just basic wrappers around some metadata. The first query here is looking at the tables loaded by OSM to PGSQL. This is where we left off at the end of the second in this series, was we had the data loaded, we have our three main tables. The fourth table down here is the roads table. There's data in there, but I personally do not use it, so it will be ignored for the rest of this talk. But it's always good to get an idea of how many rows you're looking at and how big the data is in your database. In the case of our lines table, we have 1.2 million rows. We have 820,000 polygons and 655,000 points. We have a decent amount of data, and this represents the open street map data for Colorado. One of the next things to look at are the columns within your tables. I could look at any of the three main tables here, and this result would be essentially the same. We have our columns and our data types returned from this query, and one of the important things to note is all the data types are text. We have a whole lot of different columns. Each row represents a column in the table, and it's data type being text. So we can see we have a whole lot of text data. And then at the end, we have our way column, which is a geometry. We have our key value paired data that we've talked about so far in this h-store column called tags, and then there's a couple other calculated fields that we won't be using today. But there are 70 columns in this table. The line and point tables also have 70 columns. This is one of the challenges that we have to work with when we're dealing with our open street map data. Now to get to some actual spatial data, this query selects data from our point table, Planet OSM point, and we're adding a filter for highway is not null. So we're selecting points that are in some way related to being a highway. If we look at the attribute data at the bottom, we have some turning circles right at the top. Those could also be called roundabouts. Bus stops, we have those. I've got the tags table in here, column in here. I can expand that. And so we can get an idea of some of the data that's in our h-store column with that key value paired data. So the attribute data is all nice and good. And we also have some point data over here. If you were querying this in PSQL, you probably wouldn't see point, parentheses, and then some numbers. You'd see a whole lot of binary nonsense. Dbeaver takes the step to automatically transform your geometries into the well-known text representation. And it does make it a little bit more readable than the default of just the binary data. But it really isn't very usable from a human consumption standpoint. Luckily for us, and this is one of the great reasons I've used Dbeaver, is we have a spatial viewer. I can click on any one of these points, and Dbeaver will visualize the data on a map. And of course, it's going to run a little slow now because I'm doing a live demo. I'm going to switch this back to the default of map box. There we go. And when I select a point from the table, the map updates and shows me the point that I've selected. If I control select, control click, I can select multiple points, and the map will update as I go and show me the variety of points that I have selected. And so if I knew what each ID was, this would allow me to see where a row is on the map. This is very helpful when working with spatial data. Another thing I can do here is I can click on the column header of the way, and it'll select all of the points that have been returned in the current data set. There was a question about if this spatial viewer is an extra feature, this has been part of Dbeaver for roughly one year now. PgAdmin4 also has a spatial viewer that's been in the project for about a year and a half in that project. So if you download the latest version of Dbeaver, you will have this feature included. So with the whole column selected, it will show me the points that I have selected here. Dbeaver by default limits it to just 200 rows. I'm going to zoom in down here just a little bit. And I'm going to quickly click this refresh to calculate the total row count. There's 145,000 rows. Because I'm doing a live demo, I am not going to try to load all 145,000 rows into this view right now. But they are there and this view is limited to what has been queried in the grid. I'm going to go ahead and close that panel for right now. And we're going to go down to our next query. Here I'm switching to the line table. So instead of pulling highways from the point table, I'm going to pull highways from the line table. If I go ahead and run this, we now see our way column. Instead of starting with point parentheses, it now starts with line string indicating that we have line data. There's another way to view spatial data in Dbeaver, other than doing this fly out and selecting it that way, is over here on the left side of the grid, we have the text option, but we also have a spatial option. If I click the spatial option, let me go ahead and zoom back out, with the spatial option we can see all of the rows by default. This gets rid of the ability to select one or multiple rows at a time. This will just show all the rows that have been loaded by the query results. The advantage of this is, one, there's a nice little fly out here, so I can expand this and make it full screen, which is helpful. But I can also click on a line or any object that it had loaded, and it will give me more information from the database. So by clicking on this way, it'll tell me the only two columns here that are worthwhile showing are the ID and the data in that highway column, but I can see what it is here. And I can go click on another one and see what is different there. And this one is a highway track, instead of meaning that it's more of a hiking path instead of a road you're going to drive on. So this is a really neat feature of Dbeaver, being able to have this spatial viewer. And until these features existed, before these features existed about a year ago, that meant we didn't have this in the GUI tool here, the database specific tool, we had to jump over to the more heavy-handed QGIS or other desktop tools to get that visualization. And one thing I'm just worth noting here is we might notice that we have 200 rows, they might seem seemingly random, but they're not because what we're getting is the result of the clustering that OSM, the PGSQL does during the load process. It clusters the data based on where it is on the map. That way, when you're pulling a small area for the tile viewer, which is the standard use of this dataset is for a Carto tile server. That way, all of your data is logically grouped together on disk. And it just so happens that it starts in the southwest corner of Colorado and goes from there. The downside of getting 200 random rows is I'm not very interested in this little tiny corner of the state. I don't know much about that corner, and there's really not much down there to do for as far as mapping. So that brings us to our next query. And now we are adding in a spatial join. We're still going to pull from our line table that we had been pulling from. We're still filtering for the line highway is not null, but we've added in a join. And for those of you who are familiar with SQL and the joins, this looks pretty standard, but we have this function stcontains. Stcontains is a post-GIS function that allows us to operate on the geometry data. This allows us to limit down our data. In this case, I'm going to limit it to golden Colorado. And I want to find where the line is inside or contained by the polygon for golden. So if I run this query now, it flips back to the grid. And I'm going to point out I have included both geometry columns. There's the column for the polygon for golden, and then there's the column for the road line. And the unfortunate side effect here is we grabbed the golden polygon for every single road that I've selected. But I've done this because we're on the spatial viewer with two geometry columns, it will visualize your multiple columns. So if you're selecting geometries and you're not quite sure where they are, you can include it. And this allows me to see the bounding box that I've used for my filter to grab those lines from. So this is a good helper tool to say, well, what polygon am I actually selecting within? This helps visualize that data. When you don't need that queue, it's recommended to take that polygon out because the polygon data gets big. There's every little wiggle in the lines and the boundary is a node, and each node takes up data. So this has to return a lot of data in order to pull back both columns and duplicate that golden geometry. With that golden geometry commented out, I can rerun the query. And on the spatial tab, now we see we just get the roads that it selected. We're still only getting 200 rows. We're not getting all of the roads in golden, which there are about 8500. But this way, I'm more in the area that I'm interested in working with. So continuing with this query, we have our spatial join established. We're going to now start doing some calculations. Drawing pretty pictures is really nice, and it's a neat aspect of spatial data, but a lot of times the analysis you want to work on requires something more complicated than just drawing a picture. In this case, we are going to now use the ST length function. The length function takes, allows us to pass in our geometry, and it'll calculate the 2D Cartesian length of that geometry. And that returns a number. And so then I can then take it another step further and wrap it inside a sum. This allows us to aggregate our spatial data after we've done the calculations on it. Now this query will no longer return spatial results. I'm now returning a couple grouped by columns. I have the name of the polygon that I filtered for, and then I have a classification of the highway type with a calculated length of the total amount of that highway type within the golden city limits. The problem here is what are my units? I have a bunch of really big numbers, and I can see because I've sorted by this calculated wavelength descending, I know that the service roads have the most total length in golden, but what's that number mean? Our numbers, our units, are going to be based on what projection our data is in. And in post-GIS, we do this with our SRID, or Spatial Reference System Identifier. And we have, of course, a function that will tell us what format, what projection our data is already in. This allows us to take this information, which we can see here. Our projection is 3857. This is the default from OSM to PGSQL. So if you've loaded your data that way and did not change it, this is going to be your default projection. This default projection is great to know. Now that we know it's 3857, we have a table that post-GIS installs called Spatial RefSys. This table has a listing of all of the projections it knows about, and the detail, the pertinent details related to it. So if I run this query and filter for the SRID that our data is in, the information we need is in this proge for text column. And this column, the contents of this column are not all that pretty. It takes a little bit of looking to find where the units are, but right here, plus units equals M. So we can tell from this that the units of 3857 are in meters. Once we know what unit the data is in, then we can proceed with our analysis and actually make meaningful results out of it. Before we go back to that, though, I'm going to expand on the units, because this is one of those tricky aspects that, when you're not familiar with it, is a little hard to find. What you're looking for, and you might not just be looking for the SRID you're currently in, you may be searching for an appropriate SRID for the analysis you want to run. And so I've created this helper view, and I have it in a GIST on GitHub as well. And I'll go ahead and post all of these links with the recording once the recording is over, once the webinar is over. But this view, all it really does is it parses out that proge for text column, and it looks for some known values. I've handled meters and feet. There's a difference between feet and US feet. I'm not sure that there's really a difference between the two. And then there's two other values called link and an unset that I just didn't really care to figure out what they really meant at all. And if the units are not set, the standard is to default to decimal degrees. So I have this view that I call SRID units. And with this, I can now run some higher level queries and kind of get an idea of what units are common. So the results here show that the majority of the SRIDs in this table use the meters for their units. There are also decimal degrees. And decimal degrees is what you think of as latitude and longitude. And then there are also feet. And then of course, there's those two at the bottom that I didn't handle. If someone knows what those are and wants to fill me in on the details, I would happily update my code to reflect that. But I didn't see those as important. And most of what we want to do is going to be in the three main units. We can now query for this view for multiple SRIDs. And 3857 is what we're working in. 4326, otherwise known as WGS84, is also very common and is in the units of decimal degrees. And this is one of the ways I use that view the most is when I'm working in a specific area, and I won't go on too much more about projections, but you want to be working in a projection that's suited for your geographical region. Otherwise, your math is going to be wrong. Here in Colorado, using 3857 and the units for meters is a little bit of a misnomer because we are quite a ways north of the equator. And the further north of the equator you go, the more distortion there is in the calculations of the units. And so you want to make sure that you're using a projection that is suited for your area. And I can run this query now and use an I like. And I use I like here because in post grace, like is case sensitive. I like is case insensitive. And so depending on what's in the spec data for these SRIDs, I just wanted to make sure I could capture all of them. So this allows me to filter for projections that specifically mentioned Colorado, but also do not mention that they're deprecated. There's a handful of projections in here that are deprecated. And I don't see any purpose for me to use those. So I go ahead and remove them out. And this allows me to see that there are 36 different Colorado projections in the database that I could use. So this is just one of those tools that you can use to look into your database to help you figure out what you need to be using as you're running your analysis. Back to our spatial querying. We now know that our data is in meters. And I'm going to ignore the nuances of projections and the fact that 3857 is not the best for our current area. But I can now do some more math on this and make it a little bit more usable. I've added another column where I simply divide by 1000 and do some rounding. And this allows me to get my lengths in a in a form of kilometers instead of meters. I've also updated my column name to indicate that the base length is in meters. That way it's a little bit more clear to the to some future reader who may see this code and have to figure out what I was trying to do. So this way we can see that 406 kilometers of, there are 406 kilometers of service roads, 355 kilometers of footways plus paths plus cycleways. We like our transportation to be on ground as on foot as possible here. Another helpful thing to look through in post GIS is all the functions that are installed. And I'm not showing this list through the GUI and D Beaver here through the database browser and the functions because that way of looking at the functions is really tricky. It's too cluttered. There's too much going on and I can't, I can never really get the data out that I want to see. So this is a good way to look at it to see the argument data types what kind of data is going to return and then if they've added description, descriptions to what the the functions actually do is really helpful here. And so we can scroll through. There are 257 functions installed by post GIS 3 that take a geometry data type argument that the functions that we want always are going to start with ST underscore. I believe, I don't believe there are any major holdouts of any older naming conventions. I believe all the ones we want to use are going to start with ST underscore. But this allows you to look through and see the detail, the, the variety of functions that you have at your fingertips when you have installed post GIS. Any, any amount of demos and walkthroughs and webinars will never get into the full depth of the functions available in post GIS. So I'm, I encourage you to take some time and look through the list of functions in the database after you install post GIS and get on the website, read through the docs and see how they work. Sometimes they're not quite straightforward at what they do, but the docs tend to be pretty reliable and helpful there. So go in a little bit different direction. We're going to look back at our highway, our point highways, but now instead of filtering for just is not null, meaning there's something in that column. Now we're looking for a specific type of highway point. We're going to filter, we're filtering for crossings. These are places where a footway or a cycleway is going to cross a motorway, motorized vehicle street. And here is where we start seeing some details about the tags, the unstructured, semi-structured key value pairs in OpenStreetMap. Here we can use just a simple not equal empty string and get an idea there. If I flip to our spatial tab here, just to kind of look at some of the data, and I'm going to zoom in and we can see that we have crossings crossing the road. This one's uncontrolled. So this allows us to again get in and see the data in a visual representation instead of just the tabular representation that we use so frequently in database queries. A good step to add into your OpenStreetMap workflow is some quality assurance, and just kind of looking at what data is in there. There is a lot of data in OpenStreetMap, and in the first session with the high-level overview, I spent a lot of time showing the some 3D maps of downtown Denver and explaining just the volume of data that goes into making a 3D representation of our world. So all of this data is all in OpenStreetMap, and there is a plethora of it to look through. Some of the problems with this are illustrated on this line here. Well, traffic signals semi-colon marked is not an official value that we should be using. This actually represents two values combined into one with semi-colon as a separator. This isn't really the way we do things in relational databases, and if you're not expecting this kind of data, you can be thrown off by it, or what will happen more likely is you just miss out on it because you're not filtering for it. So it's really good to do some quality assurance and quality control and looking in at these tag values to see what's in there. And this is done, this is again that h-store data is the data type. It's an extension you need to install, and you can get through the data using s-keys and s-values. There are a few other functions that you can use with h-store as well that do similar things. So continuing on looking at tags, now I'm taking the values out, and we're just looking at the keys. Here I'm working on the point table, and I'm filtering where amenity equals cafe, and then we're just looking at all the different keys that are available for all the cafes inside Colorado OpenStreetMap, and with account of how prevalent they are. So we can see that address street, that's the top one, and not too surprising. Cuisine, also one of the most used ones. We can scroll down and see Internet access. Does it have a fee? Neat. There's just a whole lot of variety here in OpenStreetMap. We actually have 101 different keys associated with cafes in Colorado. So just a lot of variety, a lot of your fingertips that you can use, but you need to know that it's there before you can start taking advantage of it. Now we'll get into some filtering, some more advanced filtering. Let's say I want a cafe, but I don't want a coffee shop. I don't know why. I like coffee, but this is the query I wrote. So here we are. We're still looking at the point table. We're looking for cafes, and now we are parsing out the data from the tags column so that we can get to the cuisine data with the tags dash little arrow cuisine, not equal coffee shop. But if we want to return the values of cafes that don't have a cuisine set at all, we also have to have an or in there. We also have to handle that it's not there at all if we want to include those missing values. So if we want to get cafes that are not coffee shops, this is the kind of query we get, and we're stuck with an or statement in that inside of our filter. If we take a peek out, we can look at where our cafes are, and I queried this. We have 435, so I'm just going to tell it to grab all of the cafes from Colorado and see where they are. We could zoom in anywhere on the map that we want, and if we click on a value, we get the information from the database. And this is a good example because it has a large amount of data in that HStork tags column. We have cuisine. We have their website. We know what city, state they're in. We have their street address. We know their hours that they're open and that they have some vegetarian options. This is all in the open street map dataset. You just have to be able to parse it out and use it. Get rid of that and minimize here. Okay, so back to a new spatial join. I'm going to start with a very basic query. And spatial joins are one of the building blocks of using PostGIS, but it's also one of the really hard concepts for folks to wrap their heads around. It took me a long time to really just let go and understand how they work. So this is going to kind of build from the ground up. We're going to start with simple query finding benches. I'm tired. I want to sit down. I want to bench. If I run here, we can see that these benches for the first 200 rows are in the southwestern portion of the state. If I pull all of them out, well, there's a lot of benches around. I don't want to drive that far. So I'm going to go ahead and add a spatial join. Instead of joining the golden, I'm going to go through a little bit wider net and I'm going to filter for a whole county. So now I'm looking for Jefferson County where Jefferson County contains my bench. So I'm going back and using the ST contains again to filter down to just benches inside of Jefferson County. On the spatial tab, we can see there's the polygon for Jefferson County. So we can see that we're on the west side of Denver, up into the foothills. And there's a few red dots as possible benches to sit on. All right, so I've narrowed down my ideal bench query to be somewhat close to where I'm at. So I can get to these benches in a reasonable amount of time. But now it might be really sunny out. So I don't want to just sit on any bench. I don't want to bake in the sun. I want to sit on a bench near trees. So I'm going to continue expanding my query. So our base table is this point for the benches. We're joining to the polygons to get our county and to limit our benches to just within Jefferson County. I'm going to add another join back to the point table. And this time I'm filtering for trees. And so I'm getting down to the points that represent a tree. And I'm going to find those, I'm going to use this STD within function here. STD within allows us to find one object that is a set distance within a set distance of another option or another layer. In this case, we're looking for trees that are 10 meters from our benches. And this is a good example of why you need to know what units your data is in. If your data, in our case, we are working with 3857 and our units are meters. If this data was instead in 4326 and the units would then be decimal degrees, this query would return extremely different results. 10 decimal degrees is a huge area compared to 10 meters. So knowing your units and what units your data are in is extremely important to being able to do any sort of spatial analysis. But with this, we can now find benches in Jefferson County that likely has some shade because they're pretty close to some trees. I've added a count up here. So I'm counting the primary key from the trees layer. And so I can have an idea of how many trees are nearby that bench. And then I'm ordering by that that count column and limiting to 15. So basically, I have just found Ryan's top 15 benches to sit on in Jefferson County. And if I go to the spatial tab here, I can see that I get just a very few number of points. I'm going to zoom into this one here in golden because I like golden. And I can see by clicking on the information that this particular bench has three trees with within 10 meters of it. It also has a back rest. And that might be good. I did say I was tired. I'm going to switch our background layer. This is one of the neat things with D Beavers spatial viewer is it defaults to the map box back tile server. We can also switch it to open street map and get them are a more standard OSM view. If I zoom in here now, I do this just because on open street map, it'll show me the trees that are mapped within there. So I can see these little green dots, the green circles around the tree. This bench happens to be right on clear creek. So you get a riverfront view and there's a park across the river. So this might be a nice bench to sit on. And I found that using SQL. All right. Now back to our slides for one short moment. QGIS. QGIS is a desktop platform, desktop software for GIS specifically. For me, the main thing that it does well is makes really good looking maps. The print composer is amazing. That allows you to create just super high quality high res output quite easily with annotations and all of the goodies that make a map really speak to someone. You can also do all of the spatial analysis. Anything that you might want to do directly in post GIS. QGIS can also do on its own. So you can do a lot of the pre-heavy lifting work in post GIS. And then you bring it into QGIS and you can do that final polishing more analysis and really explore your data this way. And one of the really good things about QGIS is when you style your data, you can actually save your styles to the database. That way your styling information is preserved right there alongside with your data itself from OpenStreetMap. The downsides I have with QGIS, working with QGIS are starting from scratch. If you're doing a query writing spatial joins like I was just showing you, those are really time consuming and tricky to get right in QGIS. It's just not a database environment like that that encourages writing a lot of SQL. That's why I like using both tools together. And you really just don't have the introspection into the database that you do with a more database specific tool. So we're going to jump into a demo of a quick demo of QGIS. I have QGIS open to a blank project and there's a in the browser, there's a post GIS section. I have my servers already created and established. So I can expand the public schema. And there are my main tables. I've got point line and polygon. I can bring in this data by just dragging it. I could double click on it. I could right click add. There's a number of ways to add the data to the map. But once you add it to the map, it's considered a layer. And it will go through and it's just querying the database for every single point in that table. And I know that this data is in Colorado. And I've shown you that, well, it does in Colorado from the D Beaver. But a lot of times in QGIS, you're going to want a good base layer to start with. And so I'm going to come to the web menu at the top. Quick map services and add a OSM base layer to our map. And that just gives us a whole lot of information and a whole lot of reference without having to add all of that data manually. Technically, this database has everything that the base layer is using to draw that map, but it's going to render a whole lot faster when it doesn't have to load the raw data. So with this point data loaded, I don't really want all the points. I would probably want to filter down so I can right click and filter. And the query builder up here is going to give me a list of all the fields if I wanted to. So I could build it that way. Or I could just say where highway is not null to stay with the theme of looking for highways. And actually, sorry, leave out the where in this case. And now we have a more sparsely populated point map. I filtered down to just the highway points. And you can see on the layers, there's a little filter icon indicating that we have an active filter. And if I hover over it, it'll even tell me what my filter is. That's neat. So that's one way to add data from post GIS. I'm going to remove that layer and add it again from another way. This time I'm going to go to database at the top and open up DB manager. DB manager will allow me to get to my post GIS databases. Oddly, they sort in a different way than the other list does. Haven't quite figured that one out. But with my database established and connected, I can now click this icon up in the top left and it'll give me a query writer tool. And so now I can write my queries. I can add in spatial joins. A lot of times I'll already have drafted my query in DB. And all I'm doing at this point is copying and pasting it over. In this case, I'll write my query because it's simple enough. I'm going to pull from the OSM point table where highway is not null. And if I execute this query, it's going to run and return the results. It'll tell us it returned 145,000 rows and took nearly three seconds to execute. Another thing to notice is look at all this null data, all of these columns. We're looking for highway data. And here we have arrow ways and brand. I don't think our highways are branded at least. Building, capital. So there's all sorts of information in here we don't need. The advantage of going through this route instead of just dragging in your data from the browser is I can now limit the columns I'm returning. If I limit it down to the primary key, the attribute that I'm actually looking at and the geometry it needs, if I re-execute this, it'll now take about a second. So it takes about a third of as long to return the same 145,000 rows. We're just getting rid of all the crowd we don't need from there. With our query working the way I want, I can now select load as new layer in the bottom. It's going to want to know the primary key for the table and the geometry column for the table so it can make everything work that it needs. And it also needs to know a name. So I can call this highway point. And then in the lower right hand corner, there's a load button. I can click the load button. Unfortunately, this dialogue does not give me any sort of indicator of if it succeeded or failed. I have to look at the layers to see that yes, my highway point layer did indeed load. So there's no visual cue that this actually loads your layer. You have to go looking for it manually. It's one of those just little things that if you're not aware of, you can keep clicking load all you want. But it's just keep loading a new layer set. Delete the duplicate. So we have the same highway point data. The downside of doing this data loading this way is either way I come at it, I have to add some filters. I have to par down the data. I have to work really hard to get the data into a usable format in QGIS. And so that leads us to talking about data structure. And if you've listened in on the first couple sessions, we've already talked about this, that OpenStreetMap data is not a relational data model. They use a semi-structured key value pairing with very little quality control on the data entry side of things. There aren't constraints that force out bad values. There's not one way to code things. There are multiple ways to code things. And the end result is we have these 70 column wide tables that contain all of the data. And the only grouping they do is based on whether it's a point line or polygon. This is not the way that things are done in relational databases. Tables really should represent one type of data. You should not be grouping the data together just because it's all points. That's like putting salaries and expenses together in one table because it's all money, right? It's salary and expenses. It's dollars. It's not appropriate for relational design. And so this makes things challenging when you're trying to use the semi-structured data in a relational world. So some of the challenges with semi-structured data are when you're trying to find objects. I showed you one example already that illustrated this. You have to come at things a couple different ways sometimes. If you want buildings to a door, you have to look at the building column. And then you also have to look at the tags entrance value because it might have it in both. It might not. You have to look in both places to get it. Another example, fire hydrants. I really can't think of why sometimes a fire hydrant would sometimes be an amenity versus being an emergency item. I think fire hydrants are pretty standard but yet we have two ways of tagging it. So we need these OR statements between one column and an age store key value pair. Capital cities. This one is one of the messier ones that I know. And if someone brought this query to me and said why isn't it not working, I would say why do you have all these nested OR statements? That makes things tricky. And it's really hard to troubleshoot things like this. We have four distinct places we need to look to get a capital city. First it has to be a place equal city. Then we have is underscore capital. We also have capital. Both of those live in the tags column. And then we have this admin level as well that we, depending on if it is capital or capital, we have to handle differently. So this is the kind of querying that if you are doing an analysis and you have this nested throughout your analysis code, your troubleshooting is going to get much harder. You're going to be a lot less confident that you got the data right. So these types of structures make querying difficult. Another thing I like to do is math on data, especially data that I would assume would be numeric. But in the case of building height, you can't just trust that it's going to be numeric. The units are meters in OpenStreetMap. The Wiki will tell you that. And you should not put the M for meters in the data. But yet some people still do. So if you want to do an average min, max kind of thing on data that should be numeric, you have to go through some work to clean out the non-numeric values first. So there's not just a straightforward average. You're going to get a typecasting error. And of course, non-sanctioned characters in column names. That colon is everywhere. And if you just try to type your column names with the colon in there like you would any other column name, you're going to get an error. It doesn't like that colon. In order to get that data out, you must double quote your column names. And this is quite annoying, especially when working back and forth between different systems. Double quoting all your column names or a good portion of your column names is not a lot of fun. And in Colorado, you might be thinking, well, there's only a couple columns. There are 1,100 distinct keys that have a colon in the column name. And these are mostly in the tags column in that h-store column. But you still have to handle them as you're working with the data. And so if you're creating table names and column names, it's really good to not use special characters and only use the approved letters, numbers and underscores in table and column names. And if you go look at the SQL style guide here that I've linked, you'll find that I've broken my own fair share of these. So pop, column, the kettle black. But the analysis that I'm focusing on is really going to benefit from a little extra TLC. The relational structure matters. Your data quality matters. And it matters to everyone, the humans that are writing the code, the humans that are troubleshooting the code. When something isn't right, when your end users are questioning your data and your results, if your code is tricky to read, it's going to be harder to find why did something not work the way you expected. So all of it making things a better structure is going to help everyone along the tool chain who is consuming the data. And for this task, I have a project called PGOSM. And I first wrote this about five years ago and open sourced it last year. The goal is to make it to help with the restructuring and data cleanup and make it easier to use open street map data in analysis purposes. And this was the diagram I was using last time talking about the process and we left off here from OSM to PGSQL. And what PGOSM does is it takes the data from those three basic tables, point line and polygon, and it puts the data into a more relational structure. And it makes each table more specific to that type of data. So that's the data that I prefer to work with in the for analysis and for visualizations is because it's already kind of parsed out into the components that I need. So back to demos. D to D beaver, I'm going to go to my next script here. And we're going to look a little bit at the original polygon table. And two of the restructured tables that I use. The first, the main table has 821,000 rows of data. This is every polygon that I'm using. If I split out the buildings, the buildings are 590,000 of those 821,000 rows add in land use. And we now have 85% of all polygons in Colorado are either building or land use. And land use is residential zoning, industrial, commercial zoning can be parks and playgrounds, stuff like that. So there's a wide variety of details in both of these tables. But these are only two parts of the data. And they make up 85% of the rows. Looking at the rest of the tables, the next largest table is the water polygon table. It has 52,000 rows that has lakes and ponds and stuff like that. Traffic polygon, those are parking lots largely. Leisure polygon, these are playgrounds and other tennis courts and football fields and stuff like that. So if you're working with any of the data that is not a building and not land use, getting the data out is not going to be as efficient as it could be. Because there's no indexing strategy that gets you indexes over 70 columns in an efficient way. So by pulling all these data points out into their own tables allows much, much easier, much more efficient access both at the technology level and at the human understanding level. Since I was looking at column, number of columns in a table earlier, the largest tables I've created from this process have 17 columns. 17 versus 70. And then most of them are a good portion of them have under 10 columns that just are specifically associated with the data contained in that table. So this is a much more relational limited data structure suitable for relational work. I also rename a bunch of columns such as house number no longer has a colon in it, building height no longer has a colon in it. And the height is actually also a numeric value too. We can see D Beaver gives some indications on on what type of data it is and we can see this is a numeric right there without looking too much further. So if I want to do a min or an average or a max on that, I can. So coming back to my, I want to sit down near a tree. I have this, this is exactly the same query as before. I've just added an explain analyze to the beginning. I'm going to tell post grace, run the query and tell me how hard it was. So we're going to look at this and I'm not, I'm not going to go into this in very much depth. I do want to point out that with post grace 12 parallel query is working so we can see that we have three workers working on this query here. That's really cool to finally see. But what we're going to look at today is the execution the total execution time. I'm going to run this query a few times just so you can see it's pretty consistent 250, 275 milliseconds. So it's pretty consistent at that level. If I take, rewrote this query and now I'm using the restructured tables. So I have miss POI point here. This has our benches in it. I have a table that all it has our boundary polygons like Jefferson County. So I have that in a table and then I have a natural point table that has my trees in it. And so I, as you're doing joins, your joins become more explicit and more explicit to what you're doing instead of just joining a point table to a point table to a point table. These actually have meaning to them now. But the important part here is when I run this query, we notice the execution time is in the 30 to 50 millisecond range instead of 250 to 300 millisecond range. Our queries are going to run much, much faster, especially when I'm pulling small sets of data out that were pain defined in the original data set. So this is going to be a much more efficient data structure for the type of querying I want to do. Jumping back to QGIS for one more demo set of demos over here. I'm going to get rid of what I had. I'm going to close up this public schema and I'm shifting over to this OSM schema here. And so this shows we have point line polygon tables scattered with the first part of each table name given an idea of what I'm working with. Before I load data and I'm zooming into golden just so it loads a little bit less, I'm going to scroll down and I found this road line table. I'm going to go ahead and add in my road lines. And one thing you should notice right away is the data is styled. This is because once you have your data split out into sensible layers or tables, you can actually apply and save your styling for those and effectively style your data as it comes in instead of having to manually do all of your styling for every layer after you do the filtering. This is not feasible without splitting your data apart because of the complexity of styles in QGIS. To get the information, we have rules and then we have zoom level rules. So we have a rule that defines what a path is versus motorway and then we decide at each level of zoom, do I want to see it and how should it look? This is quite complex to do for one layer such as roads. I'm just expanding all of these so you can see the volume of rules that go into making a visualization like this. If all of your data is shoved into one big table, there's no effective way to create these rules in a way that is maintainable by humans. At least not that I have found. So by splitting the data out, it aids your styling as well as from a performance standpoint in the database. I could continue on this. I can add, let's say natural points. We've been talking about trees. So I can add in my natural point layer and we have a peak label here in the middle for South Table Mountain, but we have all these green dots. Those are my trees. If I wanted to see another natural, other natural elements such as lines, we now have cliffs. So we can add the lines for the cliffs. I'm going to take out that main background so you can see it better. But by going through this way, we have the ability to style our layers, bring in exactly what I want. If I decide, you know, I don't actually want those trees anymore, I can take those out. If all of this is in one giant table, taking out just part of it becomes less user-friendly on the analysis side. So this has been a very quick run-through of some of the tools that are available. I have barely scratched the surface with post-GIS functions. There are so many cool things that you can do within post-GIS. These are just a few more of some of the functions that I use on a regular basis that I thought I'd throw out there. Some of these we'll talk about in future webinars. Others I will leave for you to go explore on your own. Thank you, everyone, for joining us again. I look forward to talking more about post-GIS and OpenStreetMap in a couple of weeks.