 I want to welcome you to PostGIS Spatial Queries and Performance Tuning. This session is a deep dive into complex spatial queries and performance tuning with PostGIS, where common performance headaches will be explored with a variety of approaches to remedy. GIST indexes, proper data modeling, and materialized views will be demonstrated as well. My name is Lindsay Hooper. I'm one of the conference organizers and I'm here with Ryan Lambert, the owner of Rustproof Labs. Ryan's been working with GIST since 2011. He got started with PostGIS when on a quest to update a roadmap, he started using PostGIS, Postgres, and OpenStreetMap. He has since been a contributor to OpenStreetMap projects since 2015, and he has spoken on PostGIS and had it run on both large to small scales. He's currently working on a book on how to use PostGIS and OpenStreetMaps together. With that, I'm going to hand it off to Ryan Lambert and Ryan take it away. Thank you, Lindsay, for that intro. Thank you also to PostGraceConf for hosting this webinar series. And thank you everyone who has joined us today for the live session. Really excited to be here again talking about PostGIS and this time adding performance into the mix. This is a part of a six-part series. We are now on number four of the series. The previous three recordings are already posted online. The link at the bottom here has the video from the intro, the first of the series, and also has links to all the other recordings that are available as well as the page on PostGraceConf for the upcoming two sessions. So today we are going to focus on performance. We're starting with, I got to talk a little bit about hardware, how I've configured PostGrace and the data that I'm working with. This will set the stage for some expectations of what kind of performance we can get. Then we're going to really spend the bulk of today talking about queries and looking at the results and the output and what we can gain and glean from explain and analyze and all of those related commands. And then to finish off today, we're going to really quickly dive into PG Bench and a way that I approach benchmarking analytical and spatial queries using the PG Bench tool. So to start with our hardware that I've used for all of the testing that I'm going over today was a digital ocean droplet. So yet we are in the cloud. This is a general purpose machine so it doesn't have dedicated CPUs or RAM or disk necessarily to it, which will add a little variance to it. But it is fairly reliable and consistent. And I've done a lot of other testing on this specific hardware. So I have a lot of comparison data to ensure that what I'm doing lines up with my historical testing that I've gone a little bit more in depth with. Another reason I'm using this particular size of hardware is it's large enough and has enough CPUs and RAM to start taking advantage of some of the really neat things in post-grace. But it's also small enough that I can still get performance headaches without having to load the full world file or something else. I'm not just throwing hardware at the solution. I really want to look at how can I make this particular hardware handle a particular size of data more efficiently. I have done this testing over a variety of days of the week, times of day. And really most of the timings are within about 5% of a baseline. They hover a little bit. Some days are faster. Some days are slower. There are occasional times where I'll spool up a new instance. And when I'm running the initial data ingestion of OSM to PGSQL, it becomes immediately obvious at that point that the server I'm using right now is not as fast as I typically expect. Friday late afternoon into the evening here in mountain time is not a good time to test digital ocean droplets in New York City data center. At that time I've actually been able to get 30 or 40% slower speeds because that data center apparently has a lot of load at that time. So it just could be something to be aware of that if you are doing this in the cloud, as I am showing, that sometimes your results will vary based on the overall load that your cloud provider has. So the specific versions I have used for my testing today are Postgrace 12 and PostGIS 3. 12.1 and 3.0 are the exact subversions that I've used. And I'm mentioning this because I am using two specific features in Postgrace 12 today. Explain has a new option for settings that is extremely helpful, especially for me as a presenter where I'm trying to get a lot of examples together. I'm going back and forth between different configurations, different queries and different setups. Having this settings option in the explain helps me validate that I haven't forgotten to flip a switch on one side or the other and that it actually ran with the configuration that I wanted. And also parallel query now finally works with PostGIS data and queries. Parallel query has been around for a number of versions and every major version since it was released has had improvements to parallel query. The main improvement that I'm aware of in Postgrace 12 is it works for PostGIS and I care about that a lot. So I have it enabled for most of our querying for today. So getting into the configuration, I've only changed a few settings from the default for Postgrace 12. First are shared buffers and effective cache size. And I set these to a fairly standard 25% and 75% of the overall system total. This might not be ideal for every situation, but it's a really good starting point for these two parameters. I also increased checkpoint timeout from its default of five minutes to 30 and the checkpoint completion target from its default of 0.5 to 0.9. These really won't have much impact on the queries that I was running on a test server with only me running my test queries, but on a real production server with other workloads and specifically write queries, updates, and inserts. This can help spread out the overhead of those checkpoints over a longer time and give you a little bit more consistent performance. At the bottom, I have max parallel workers per gather. The default value now in 12 as well as in 11 was 2. So it is enabled by default. I have adjusted this to go up to 3 because, again, I'm the only one that's going to be running queries on this server, so I can take a little advantage of cranking that up and using another core of my CPU for these parallel queries. And then the last thing I do here is I set just-in-time to off. As of post grace 12, just-in-time has been updated to default to on, and this isn't, in my case today, it really wasn't a problem, but it also isn't a benefit that I have found to post GIS queries yet. And when you have just-in-time enabled, it does add a bunch of output to the explain-analyze. And so I've just turned just-in-time off to help keep that output clear. And now into the data. The data is the US West region that I downloaded from Geofabric. This is the data that I loaded onto this server in number two of this series with OSM to PGSQL. And I have done the transformation using our PGOSM project that restructures the data from OSM to PGSQL from its original point line and polygon tables into a much larger number of tables that are more specific to the data that it contains, such as roads or waterways. Instead of just putting all the lines in one table, we can split those data elements out into separate tables because they have different attribute columns that are about them. So shoving all of that data into one table from a design perspective isn't really ideal. And a point on the data size here is our data itself is just over nine gigabytes with another couple gig in indexes. So a little over 11 gigabytes total on a server with 32 gigabytes of RAM, all of this data should fit into the system's RAM without a problem. So that is not going to be a bottleneck for us where we have a data size that is substantially larger than the RAM we have available. So this should help our performance overall. So using explain, there are a number of options that go along with explain beyond just the explain itself. When you run just explain, PostGrace is going to look into the black box that is the query planner and tell you what it thinks it's going to do in order to run the query you've asked it to. With just explain, it will not execute your query. It's just deciding what it thinks it will do. When you add some of the other options in, they start executing and also allow us to look at more specific details and get more insight into what's happening inside the query planner. The docs on PostGrace site have a more complete explanation of all of this as well as a few of the extra parameters that I haven't shown in this example. And it's worth reiterating that when you use analyze with your explain, it will run the query. This also means that if you're running a data modifying query, it will modify your data. So that's not always a problem if you're okay with changing it, but you do need to be aware that it will actually update your data if you run your explain with the included analyze. And one more little housekeeping bit is to vacuum analyze your your data. This helps the statistics be updated to make sure there's not a bunch of extra bloat around and really helps set you up for optimizing the query to run in its best possible form. And that's one of the details when you're looking when you're using explain with PostGrace, you're generally not looking at a worst case scenario. In most cases, you're trying to say, well, if everything is going normally, what would the server do and what is the server doing? And that's very different than the PG bench approach as we'll talk later on. So looking at performance, I am specifically looking at analytical workloads. As I've mentioned a couple of times throughout this series, I do a lot more work with analytical workloads than I do transactional. And analytical workloads have differences from transactional workloads. And a couple of key points about that are you're generally going to have less users executing queries against the server. And your those queries are generally going to be more expensive. Your analyst is probably in there looking for oddities. And oddities rarely fall within the standard use case of what you've defined and develop your database to have. So these queries can be more expensive than what you would be running if you're if you have a web front end, and it's a transactional workload like that. That's a different type of modeling for your workload. So now we're going to get into looking at a query. If you were here for the last session where we looked at D Beaver and QGIS with a bunch of queries, this query might look a little familiar. In this query, my goal is I want to be able to find comfortable benches that have trees nearby. I like to sit in the shade. And so I have a query here. It starts with a base table of miscalculi point. And in the where clause, I am filtering that that first table down to a specific code. This is how I have benches coded in the database. So I can I'm filtering for benches. And then I'm going to do a join on our boundary polygon. And this uses the ST contains post GIS function that allows me to find the points inside of the polygons. So this allows me to aggregate up to what are essentially the county level. That's defined by admin level six. And in US data, that's going to generally relate to counties, other parts of the world that admin level six has a slightly different meanings. So be aware of that when you're working with this global data set. And then last, we join do one more join to the natural point table. And that's what contains our trees. And I'm doing adding another spatial filter here on this join using this STD within. And this allows me to find points and ways that are within 10 meters of each other. The 10 meters here is important to understand that this follows the projection of your data itself. So if your data is in a projection that uses meters and our data here is that's the units that it will use. If your data is in decimal degrees or feet, that is the units that will use. So you'll need to be aware of what projection your data is in when you're using queries that rely on units of your spatial data. And I did go into more depth on that in the prior webinar session and how to investigate your SRID and what projection units your data has. So when I run the data, when I run that query, I'm going to get back some text. And this is just a small snippet of the text, we're going to look at it in a more user friendly way here in a second. But it is worth getting used to this text because this is what the first thing that you'll see. And it's a whole lot quicker to scan this once you know where the pieces are and the details that you're looking for are than it is to copy and paste into an external tool for visualization. It's not really ideal at all for looking at this in a webinar screen sharing sort of environment. But luckily we have a tool that can help us with this. This explain.dapes.com site. I've been using this site for about as long as I've used Postgrace. I'm pretty sure I've discovered this site within weeks of running into my first slow query. And the first thing I'm going to do is switch to this source tab at the top. I've already pasted my query plan in and it saved it for me. And now I can see the full details here about what I'm doing. And one important piece that I do want to point out is each section of the output because I ran explain analyze is going to give me both predictions from the query planner and what actually happened when it ran the query. So the first part in the first set of parentheses on each grouping line and the other grouping lines typically have arrows to point at them. You'll kind of figure out how to pick those out from all the lines of text. So we have our predictions here and then after the predictions we have another set of parentheses and it starts with actual. So this is going to tell us what our actual times are, our actual row counts and what it actually did when it ran this query. One of the helpful pages on this output is the stats page. I like this view a lot. And this is where one of the main things I'm looking at when I'm pasting it into here is this page. Specifically this per node type stats table at the top. And one thing I can see in this query is most of our time, 94% of it, was done scanning our indexes. Now this doesn't mean that our query, our data is properly indexed but it does indicate that we're most likely not missing a critical index. So it's not necessarily a good sign or a great sign but it is a sign that at least you have indexes in place and it's able to use those, especially when it's a majority of the total query. On the HTML tab at the top here, this tab breaks out each of those individual steps into one row per step. And all the associated details. And then that breaks out some of the more interesting details and does some color coding to help you kind of identify where, you know, if you're looking for a problem, the colors are a decent place to start. So this is here and this is available. And I'm not going to, I don't have enough time to really dive into the output of each and every one of these results. But every query I'm showing does have a link available at explain.depes.com. And so really all you need are the last few pieces of that URL. And you can take this in and put it in the URL in your browser. And if you want to look at the output, the full output, as I'm going along, you're able to do that. Also, after the fact when I get this recording posted, I'll ensure that I include all of these outputs as well at that time so you can review them at a later time as well. So I'm going to move on to a new query. And if you looked closely at the old query, the first one, and now we're looking at this one, they might look very similar and they are. I'm using two of the three same, the same tables after all. But I took out the benches. I took out the benches. And now I'm just getting an aggregate of trees within counties. I have my same ST contains here for the join. That's our spatial join that allows us to get those trees that fall within a particular boundary or a polygon. And it's just going to aggregate them together and count the trees for me. On the surface, it might seem like that query might run faster. It's simpler. It only has two joins. It only has one join instead of two. But it's going to run slower. One of the key elements of why is the benches in the first query is a very small number. There aren't that many benches. And I was also adding in the exclusionary filter to find only benches that have trees nearby. So that took a small number and made that small number even smaller. But now when I took out the benches and I'm just looking at trees versus county, there's a whole lot of trees and none of them are being removed by the query. It's just counting them all. And that is one element of the challenge that we're dealing with. But that isn't the spatial element of what we're dealing with in this particular query. The problem that we're experiencing is due to large geometries. Those polygons for those counties have incredible detail to them. So if you zoom in to a street level kind of view and you're looking at a county boundary, you can see all the little curvatures of that boundary, which often are quite made up by humans that follow roads, major roadways, or go around other administrative boundaries. There's a lot of points involved to making all of those little zigs and zags in the data. And we have this STN points function that will allow that counts how many points each geometry has for us. So this query is just going to take a look at how many points are within those county boundaries that we just used. And then it's going to order by that count descending so we can see the largest polygons first. The results of that query, the first few rows show that we have a number of counties that have tens of thousands of points. And each point is roughly 16 kilobytes of data. So just to kind of bring it back to a disk space and disk usage size, we're talking about a decent amount of data for any one of these polygons. And if I extended the results down, we'd see that there are a large number of these polygons that have thousands of points. And so this is one of the challenges with spatial data and performance are these large geometries. Luckily for us, Post GIS has given us a tool to help with this problem. In the case of ST subdivide, it has one, the required parameter is the geometry itself. It can't simplify a geometry if you don't pass that in. And then there is an optional parameter for max vertices. The default is 256. And what this is going to do is it's going to take your very large polygon, and it's going to split it up into multiple rows. And every row will be guaranteed to have no more than 256 vertices or points. I personally have never modified this max vertices. I'm curious to know if there is a performance difference on, you know, if you increase this or decrease this, but I just kind of always figured that there's a good reason for 256. And I've just left it there. But even the documentation does indicate that it's probably going to run faster if you subdivide your data and then index it. That and index it is a little key component, too, is to index it. It has to actually be somewhere. So I'm going to create a materialized view. This is another very handy tool in post grace. Materialized views are very much like regular views. They're a stored query in the database. The difference here is a materialized view runs the query when you define it, and then it saves all that data to disk. And in the case with post GIS functions, a lot of the operations we want to do are expensive. Subdivide itself isn't particularly expensive, but it isn't free either. But by doing the subdivide and in a materialized view, we can persist the split up data to disk, and then we can add indexes to it. So here I am adding the first create index is going to add our what we would call the spatial index. Help if I could draw lines. The spatial index is going to be what allows our spatial queries to run fast when we're joining a geometry with another geometry. I also am creating an index here on OSMID. That OSMID allow us to link back to the original source table and get back to that original geometry. Now I can rewrite my query. I am using my subdivided materialized view here, and I am joining to the points table to get out those trees. And this query is 97% faster. It runs well under a second. Cool. That's way better than 23 seconds. But if you were looking closely, you might have noticed that this query does not have the geometry data included. I just grabbed the OSMID and the count. It did the aggregation that I want because I do want the data by county. If I as an analyst send my target audience a dataset that has one row per subdivided county with the number of trees in it, that's not what they wanted. I need to get back to that one row per county. Also, when I go create the pretty pictures in QGIS or other software, that subdivided data has all these extra borders. If you have borders on your display, that's going to show all those extra borders in your visualization. Because of this, I do need that ability to get back to the OSMID that ultimately the original way. To do so, one might be inclined to just add a join in. We have our subdivided polygon, and then I added a join here to the original boundary polygon table. I'm joining on OSMID. I can assure you both sides of that join do have indexes. Then we're also adding our count to the trees. At just a point, I do have a left join in here because I want all the counties, even if there's a county with no trees in it. This is honestly the query that I would write naturally if I went, oh yeah, I need back to that data. The problem with this is it's not going to run very fast. It's actually going to run slower than the original query did. Bummer. One of the problems with Postgrace and PostGIS joins specifically is when you have multiple joins on spatial data and other joins thrown in there, it doesn't get good row estimates a lot of the time. When you have bad row estimates and complex queries, things go out of whack kind of quickly. One of the tricks we have in our toolbox is we can rewrite this query. If I simply use a subquery here, and so inside the from, I now have the original query. This is what did the aggregation and the join with that subdivided table, but doesn't have the geometry. Then I can join to that outside of that subquery, and this query will, like the one without the way at all, will run in .6 seconds. It's going to run much faster. What we've done here is just help the query planner. I realize where the boundaries are of certain join operations, and we're in some way hinting to the planner, you should probably do this first and then do that. It's kind of like in math and algebra and calculus, when you start getting more complex formulas, you might not technically need the parentheses around all of the operations, but it does help understand what order your thinking things should go into. It's a good way to indicate intent. Now we're looking at those query plans a little bit. The first one, without any subdivide at all, the one that I just ran as is, we can see that 68% of its time was on this partial hash aggregate. What this is telling me is it's spending a lot of time putting data back together to get that fine of those counts. We then did our subdivide and our simple join, and this is the one that went even slower than the original. When we look at the stats here, we see that 61% of it's on a sort. Well, if you see a disproportionate amount of time on a sort, you can probably be sure that inside your query, you have a sort going to disk somewhere. This is bad from a performance perspective. You really don't want sorts spilling to disk. Our disks are slow compared to RAM. In this case, we're getting about 125 megabytes written out to disk to sort this data. That's what I was talking about when the query planner wasn't sure what to do. It decided that it needed to do this instead of something that was actually faster. When you see a large number on your sort is a good time to go look to see if you have spilling to disk, there is a configuration parameter we can change for this, but I'm not going to change it on this query. We're going to get to that option here in a little bit. Now getting on to that subquery version where it ran in 0.6 seconds again. A good portion of the time was spent on an index scan, and it still had a noticeable amount of time as far as percentage on that hash aggregate, but it's no longer two thirds of the total execution time. So we have a faster query. Postgrace understands what to do, and we can get our data out in a hurry. I personally am not a huge fan of writing subqueries in the from clause and inner joins and stuff like that. I find that logic to be harder to read, you know, six months or a year after the fact when I have to go figure out, what was I doing in this query and how did I get that data? I personally like CTEs or these with queries. They start with the word with, and then you alias that subquery with a name. And this is essentially the same query as the subquery version. I have just moved this portion up into the CTE, and then at the end I joined to that CTE is from A and go from there. And the query plan of the CTE version versus the subquery is generally going to be similar or identical, at least in the most recent versions of Postgrace. Older versions of Postgrace, the CTEs might not get as good of a plan and you might have to use a subquery. That's why I showed the subquery first. But at least with Postgrace 11 and 12, I haven't noticed much difference between the two methods. So we're going to pivot away from our trees and polygons, our points and polygons queries. And now we're going to look at waterways and roads. And here I'm specifically looking for waterways that happen to have a road within 15 meters of it. A couple reasons why I might want to do this is one, I like hiking. I want to be able to hike somewhere and I also like fishing. So is there a stream nearby? That's a, I don't want to go through the bush too much. Or another thing might be we're looking at runoff from roadways and possible contamination into our waterways. So there's a few reasons why you might want to run a query like this, specifically looking at distances like that. And we've already established that large geometries can and will cause us performance problems, especially when we're doing joins between large data sets. And going back to the original counts earlier, so this waterways table has 1.7 million rows of data. And overall, if you look at the average, the average waterway only has 23 points, not huge and well under the limit of the default of the subdivide function. But we do have about 11,000 waterways here that are considered on the large side and could be simplified and might or might not have helped our performance. I do want to point out one little bit of post grace goodness here. The filtered aggregates above the where I just drew this line that we have a normal count all that counts all the rows from the resulting data set. In post grace, we have the option to filter our aggregates using the special syntax that allows me to get a count of just the large polygons or large geometries that we're working with here. And there are other ways to approach this type of query, but I really like the syntax and just wanted to highlight this about post grace. Looking now at the roads because this operation is going to have two table, two base tables, we're looking at roads near waterways. So looking at the roads, we can see that we have 6.4 million rows and about 22,000 of them are considered large and would be subdivided if we went that path. But the nice thing about this query is all I have to do is change the from to a different table and I can get these kind of stats on any of the tables in my database. So I just have this as a snippet that I can run to see if the table might be a good candidate for subdividing or other minimization strategies. So now knowing that we have two large tables and both of them contain some large geometries, let's just join them together. What could go wrong, right? Well, it's probably going to be a bad thing joining a big table to a bigger table with big geometries and trying to do this on the fly without that much hardware power. It takes 9 to 10 minutes. This is the only query that I'm showing today that I actually put a range on the timing because this query was really unpredictable in its timings. About 85% of the time I run it, I get 9 to 10 minutes. Every now and then it'll run faster but predominantly it would run a little bit longer. This query is hard for the server to run. It takes a lot of time and it's got a lot of data. I don't like waiting 9 or 10 minutes for the results of my data. I'm going to re-follow the steps that we did with a polygon. I'm going to create a materialized view for our roads. This time I'm going to add a step in here. I'm using a window function with this row number to order by the original key and then the geometry that I'm creating. This will give me a unique number for every row that I can add a unique index to. The point of this is the assumption is I'm going to use this table, this materialized view, a number of times in different places and different queries, and I'm likely going to eventually want to join back to it. Having a unique ID in that materialized view can help us get back to this data a little bit easier. Another step I did, which I didn't do in the initial query, is I'm starting to filter out some data. I'm going to pull out some of these highways. Maybe I don't want the non-motorized roadways, so I can take those out. The end result is even though I'm subdividing a bunch of those really large roads into multiple rows, we end up with only 5.85 million rows instead of the original 6.4-ish that was in the base table. So we did reduce our row count and we're reducing the average size of our geometry, which both of these should help our queries run faster. At this point, we do want to create the unique index as I mentioned, as well as creating another just index on the new wave column on our new geometry. This is, you'll notice here, I included the timings. The unique index takes three to four seconds to generate. The spatial index on this data took just over a minute, so creating these indexes is not free, but in the end it will pay off. And then, so the other side is the waterways and we're going to create this table and I'm creating this one as a table instead of a materialized view because of the premise of my analysis I'm looking for waterways with roads nearby. And the reality of doing an analysis like this is I started with 15 meters, but I might want 5 meters, I might want 30 meters, I might want to differentiate between there's roads with the roads that allow motorized vehicles nearby or just foot traffic. So I might want to have a lot of different categorizations so I can do multiple operations in this step, create additional attribute columns that I can then update later on. So that was my reasoning for creating a table here instead of a materialized view. This step is also not just doing the subdivide, but it is also adding in this buffer. Because I've decided I want 15 meters, I can instead of doing a D within on the fly, I can generate the 15 meter buffer and then I can do a more simple intersex later on in the final query. And again, this took about two and a half minutes to run to generate this table. It's doing a little bit more work to buffer the waterways and before subdividing it. And I'm actually doing it twice, not the most, not the most performant way to generate a unique ID in a table, but it's pretty easy to do and typically isn't a huge hit on performance. Then I can add in a primary key. I don't have to just use a unique index like I did on the materialized view. Here I can create a proper primary key, which makes the database designer and me feel good and fuzzy inside. And the just index created in about 20 seconds. So this one was a much faster index creation, but it's also a few million rows less data. Now I can take this query. I can take the query I want and rewrite it based on our new layers. So we have our waterway buffer table and our roads materialized view. So I have those as my base tables instead of the raw tables. And then instead of doing a D within to say find me roads within 15 meters of waterways, I can use a ST intersex function instead. And in general, this will just be a little bit easier and a little bit faster than trying to do that on the fly. Now with this query, I'm at 75% faster than just trying to run that whole thing as one raw query. We're going to dig in just a little bit more. I'm going to come back to spilling the disk and IO timings. So this query happens to spill the disk. The last one was 125 megabytes and this one's only 13 megabytes. So it's a much smaller problem, which means the side effects will be much smaller, but it's also easier to fix. We do have this extra line on this output here that gives us IO timings. I'll talk a little bit more about that here in a second. So the parameter that we can use in the configuration to adjust this and help remove this problem is workmen. The default value is four megabytes. It's been that way for a lot of years. And the reason why it hasn't been increased to a larger number is because this parameter changes how Postgrace works per operation. This is not per user or per query, it's per operation. And one user can run one query that has many, many operations. And the more complex your joins get, the more sorting you're doing, this can get really complex. So adjusting this can have much larger side effects than you might be estimating. And that's why I didn't try to tackle this on that 150 megabyte spill to disk. I'm tackling on a much smaller spilling to disk. And so here I've just showed the output of my PSQL commands. I set workmen to 50 megabytes. You can use show workmen to validate that it did take effect, or if you want to see what it is before you change it, you can use show workmen. And with this, it's going to run about eight seconds faster. As we saw in those original timings, it was only about eight seconds of IO timing. So that's about what I expected. But here is one of those places that the settings option in the explain analysis is helpful. So the settings output from the explain analyze will confirm that my work mem did get changed to 50 megabytes. And that's very helpful. So the IO timings, well, you can get those IO timings if you run explain with the buffers option. Also, you have to have in your configuration, track IO timings set to on. The default is off. And the total time that we gained from changing the workmen was eight seconds. But if you do the math, you see here, we're closer to nine seconds. Well, just because we got rid of the sort on disk doesn't mean we got rid of the sort. It still has to handle that data. It still has to sort it in memory. And so you're not going to get 100% of these right times back as a gain in your performance. The track IO timing also comes with a warning. The docs have a warning about performance on certain platforms. So you don't want to just go turning it on because you heard some guy talking about it. You want to make sure that it's not going to cause majorly negative side effects on your system and your setup. So just be aware that depending on your system, this could cause some pretty severe side effects. So if we look at what we've done on this query as a total of times, we started with nine to 10 minutes on that just joining two big tables together. By adding some pre-processing in before we run the query, we got it down to about 30% faster as a whole. But the reality is you're probably not going to use those materialized layers, that table and materialized view. You're not going to just create them and use them once. You're probably going to be able to reuse them for a number of times. So that 30% faster is assuming that you're doing everything in line every time, and that's probably not going to be the reality. The pre-processing and cleanup steps really should be part of your overall ETL process, your cleanup process and preparation that's mostly you're predominantly automated. In database land, we know that filtering is good. Less data is easier to process than more data. And so if you want things to run faster, you can generally filter down to less data and it'll probably run faster. To start with, I can add a new filter in continuing on with our materialized view and our table. I'm joining now our waterway buffer table to the boundary and filtering for Colorado. So here I can filter down for waterways that are within Colorado. Well, waterways that have a 15 meter buffer within Colorado to be technical. So I have that in the first part of my CTE and I've aliased this as region water. And so now down below I have my region water from above and I can join that to the roads layer. The roads layer will automatically be filtered to the boundary of Colorado based on these waterways. Granted, there might be some roads way outside of that are nowhere near waterways that won't be included. That's fine. We don't want them anyway. And then to end it up, I can just join our intersected data back to our waterway query. And this now, this query by filtering down to Colorado runs in 96 seconds instead of roughly two minutes. So we cut off about half a minute. Not phenomenal. If you're really looking at filtering data as a way to greatly improve performance, it's better to do it earlier than later. So if I want to filter for a region in Colorado, the region of Colorado, I can use a query like this. I'm creating a table called my area. And I'm doing this because I can join to this my area table later in other queries. And if I want to change this filter from just Colorado to maybe Colorado and Wyoming, I can change it here in this filter and all of my subsequent queries that joined to this for their filtering will this that change will propagate through those later on queries. So I have this query and that that reason for using multiple regions is why I use this ST envelope and ST collect. So ST collect is a spatial aggregate. Just like you can take a sum of a bunch of numbers and get the total value of that sum, you can take a sum of all of these geometries and you get the total value of that geometry. And so that's one way that gets me from if I did multiple boundaries, that would get me down to one big boundary. And then I use ST envelope because that just gets me the bounding box. So this is going to it's not going to give me the full polygon of the geometries, the underlying geometries. It's going to give me the area that that geometry lives within. And so this just will be a little bit faster later on. So now with this my area table created, I'm going to recreate that Rhodes materialized view. And here's another really neat bit of syntax that Postgrease has that not all databases do is the drop if exists syntax. This allows you to try dropping a table. And if it doesn't exist, no big deal. It won't throw an error, roll back your transaction and cause further queries to stop and not run. It'll try to dropping it. If it doesn't exist, it'll keep going. That's just a little another little bit of syntax that's nice with Postgrease. Here I am creating my primary key again using the row number window function. The only difference is I've added my join to the my area table to get down to those roads that are within the boundary I defined in that my area table. And now this query runs in about eight seconds. So eight seconds is quite a bit better than about 45 seconds that the original was running in. And again, like I did on the original, I'll create my unique index and just index. These didn't change, but I didn't want to leave them out. They did get recreated. And now I can go back to recreating waterways again using my area. So this is the same same query as before, just has the new line joining in my area to limit down the rows of data. And this again needs its primary keys and its spatial gist index. With this, if I now run my query again filtered with our materialized layers filtered down to just Colorado, this query runs in 12 seconds compared to I believe 96 seconds of the last time where we were filtering to Colorado on the fly. So this is faster. And if you filter to even a smaller region, such as a specific county in one state, this whole process from creating tables, materialized views, indexes, and running the query is going to take just over two seconds. So again, the more you filter, the more you're going to get down to a query time that is really, really fast. So this chart shows kind of the progression we just went through from the beginning of this, where I just joined the raw source tables together, nine minutes being on the optimistic side. The second bar from the left shows the first iteration of cleanup where I created that materialized view filtered out some of the non motorized roadways created the waterways and then did the join there. So by doing that pre structuring and pre cleaning, we cut off roughly three minutes of total execution time. Filtering down to Colorado, we got it down to under two minutes. And by looking at Jeff Co, there is a line on that. There is a fourth bar there. You might not be able to see it. It's barely sticking up. But that was just bare seconds compared to a few minutes. One thing about this timing chart, I couldn't go to any one of these queries again, and I can rewrite that query better. I can make any of these four queries in the process start to finish better than what I've done. Getting ready for a session like this, at some point, I've got to stop making it better. And look in hindsight, I missed that I could have done this. So just because this is what I got on each individual step, if that step is where is kind of the level of aggregation and pre chewing that you need to do, you can continue working at that level and work on other optimizations to make that particular layer go faster. And I'm a big fan of starting small and scaling up. I do this whether I'm working with spatial data or just your generic plain Jane relational data. If I'm working with data over time, I'll filter down to a small fraction of the total time I might be interested in to work on my queries, make sure that my logic works, work out the process, work out if I even have all the data I need in the format that I need it to be in. So starting small allows you to iterate really quickly, and then scale up. If the final process does take a dozen minutes or even hours to run, then it should be automated at that level. With starting with small regions, one approach I take is I'll use language such as Python to iterate through a list of regions. And so I might be doing data by county, and I'm just looking at Jefferson County to start with. But once I get it really nice for Jefferson County, I can take that filter off and let my Python loop through county after county after county and run those queries over and over for me. I don't do that manually. So there are ways to scale up and scale out without manually waiting for some long process just because the full data set is large. So switching gears real quickly, I'm going to talk a little bit about visualization. I mentioned a little bit earlier about how subdivide is not ideal for visualization because you get all those extra polygon boundaries around and borders that you have to figure out how to style out or aggregate those geometries at the end, which is expensive. Outside of subdivide, there is another function called SDSimplify. SDSimplify does modify your data. It really does simplify it. It takes all of those zigs and zags that I was talking about on those county boundaries and it finds which ones it can remove without drastically changing the shape. So it is changing your data. But the goal here is if I have the US West region loaded and I've done my nitty gritty analysis in the data, I've written all my queries, I know what the data is, and now I need to make a pretty picture for the end user. With that pretty picture, I'm going to be looking at larger regions. I won't be able to see the zigs and zags that make up an individual county. So that data is worthless at that level. And by using SDSimplify to reduce how much data is in those polygons, data that's not going to be visualized anyway, you can get your rendering in that final visual tool up by 80 to 95%. So if you're doing a lot of visualizations, just be aware that the tools that you use for actually doing the analysis and the joins and that nitty gritty level stuff might not be the same tools you need for the visualization steps. Another example of where there are multiple ways to get a job done in post GIS, my examples today have predominantly used STcontains to find this within that or intersex, this near that. It might be faster to use and and instead of STcontains. It's going to give me very similar results, but and and will have, it does a bounding box comparison instead of doing the real containment query. So in my case, a lot of times I do want the more precise, more exact definition. But if you don't, if you have a few extra rows in the output is not a problem and doesn't mess with your analysis and performance is your key target and and might save you a little bit of time on some otherwise really expensive queries. And in the nutshell, post GIS has the tools that you need. Moving on to PG bench. So PG bench versus looking at one query with explain analyze is a very different approach to looking at performance. When we were looking at explain analyze and when we're when you're looking at that you've taken one query and you're dissecting it, you're looking at what's going on behind the scenes and really trying to understand what it's doing with PG bench. It's more a matter of let's see how much I can throw at you and see if you crash. It's kind of the approach. How much load can your server handle? It's a tricky challenge. I will profess to be no expert at benchmarking. But I make my best attempts and I share what I do and make it available so others can compare what I've done to what they're doing. I think that's the one of the best ways to approach it is just be open and transparent with what you've done and keep trying. And with analysis and with analysis workloads, just throwing as many queries as the server can handle really doesn't tell you all that much, at least not much that's helpful to me. More what's more helpful is when you start adding in rate limiting and a quality of results and performance in the form of latency limit. So by rate limiting we're saying really we're not going to have more than X transactions per second against the server. And latency limit says when those transactions are running, my end user does not want to wait more than this amount of time. And so this allows us to kind of hone in on what might be a more realistic workload for our workload and our servers as well as working in some quality aspects within. I don't like waiting for my data to come back. So to start with I put a query that's kind of representative of the workload I'm trying to model and understand into a SQL file. And ideally you're not going to use just one query. You're going to have an array of queries that kind of capture the overall gist of what you're working with and the types of queries that your server handles. For this example though I'm just doing the one file and this is that query using Jefferson County data that returned in .8 seconds is where we were at with this query. So here at the top in the top box we have the PG bench command that I used and at the bottom we have the last portion of the output from PG bench. So the first line of PG bench the C4 or C10 J4 says I want to use 10 clients with four threads. T is 600 which is 10 minutes and honestly that's not long enough to really get meaningful results from PG bench especially where I have checkpoint timeout set at 30 minutes. This will not guarantee go over a checkpoint. So that's a problem but for this testing is sufficient. I have my rate limiting here as R3. I've decided with my 10 fictitious analysts I've decided they're probably not going to be running more than three queries per second. And as someone who does the analytical work there's a lot of time thinking and looking at the data and trying to understand what the problem is and thinking about how so you're not just query, query, query, query. It's not a all as fast as you can kind of thing. There's a lot of waiting and stops in between. So a limit of three transactions per second for my 10 hypothetical clients is not that bad. Now latency limit is in milliseconds. So what I've done here is set it to one and a half seconds as my quality target. I really don't want my analysts waiting more than one and a half seconds for this query to return. That's almost double the best case scenario that we tested with explain analyze earlier. So that gives me a lot of leeway in performance before it really becomes a problem. But it does set a threshold on where I expect that quality to be. And then the last part is to specify the file or files. You can use this f flag more than once. And then of course the database that you're going to run it against. And so the output down below here, it does confirm it doesn't run exactly three transactions per second. It tries to keep it close. In this case we got 3.1. I'm going to move on to some aggregate. So that took some of the key details out of that output and put them up here to kind of digest a little. So in 10 minutes we were able to process 1,862 transactions, had a very small handful of them that were skipped 2.2%. And 4.5% of those queries were late. They did not run in that 1.5 seconds that I said. And we can also see that our latency average, that's the time that it takes for it to run the query and get it back to us, was 0.9 seconds. So our latency here, our query time to execute went up by 0.1 seconds. So it did take a hit on its ability to keep up with the server or with the workload. And I worked on this PG bench command to get it to that threshold. My goal was to get it to the point where it starts to break down. I want to start seeing these late queries and the skipped queries and I want my latency average to go up because that's how I know that I'm really pushing the server hard. I don't need to throw it all out, but I do want to push it hard. So now with our server running quite hard to get our results, I'm going to disable parallel query. I set it to zero and turn off that feature. When I ran PG bench again, I now get 1,872 transactions. Technically that is more than before by only 10, though. So it really got one query per minute faster results. So if you just are looking at that TPS number, that's really not the whole story. It's really just a part of the story. The other part of the story is none of these transactions were skipped and only a small fraction of them were late instead of a good 4.5% versus 0.5%. That's a meaningful result there. And our latency average came back down to 0.8 where it was when we ran the explain analyze under ideal conditions. And so from this, we can see that under load, it is likely that our parallel query is going to cause contention on the CPU. By having parallel query when the server gets really loaded, all of those queries that are trying to use parallel query are now overlapping more than they would if you had parallel query disabled. And now to be fair, I did crank up the setting to use three workers instead of the default of two. And what we would likely see if we did a more significant testing between two and three workers is that we would likely see that two is a little bit more stable than three as the load increases. Even though with three workers, it will be faster under likely be faster under ideal conditions. So one more configuration that I would look at changing is the maintenance work memory setting. The default here is 64 megabytes. And unlike the work mem that we talked about earlier, this isn't a per operation. It doesn't exponentially grow like the work memory changes can. This speeds up your vacuums, your create indexes, and those generally aren't running a huge number of those types of operations at once. So you can probably change this up. I often will set this to one gigabyte on a 32 gig of RAM server. I haven't really done any performance checks to see how much it helps my performance on those operations, but likely it's helping. So if you want your spatial queries to go fast, I'm going to tell you basically the same thing that I'll tell you if you want your relational queries to run fast. Put your data into a proper structure. Make sure you have primary keys, foreign keys, check constraints, anything that can help make sure your data is good and clean and structured right first. And then all of that ETL kind of stuff of pre-processing should be automated and done early on. So you can run those queries faster later on. One really cool thing with Postgrace is the streaming replication. I love using streaming replication to distribute my read workloads. When I'm working and QGIS making pictures that's largely just reading data out, I never connect to our primary read write server. I always connect that to a read-only server and put that load over there. And then of course, if you're still having slow performance, you can always scale up your hardware. That is always an option. There are plenty of places to ask for help when you get into this and aren't sure where to go. The mailing lists are a great spot. There's a performance specific mailing list that you can get in touch with. There's a Slack channel that you can get in with a PostGIS channel specifically that you can ask questions in. And there are a couple, a few more links to some resources in the documentation and the Wiki about tuning and using Explain. All right, Nico said that they use pre-cache subdivides as well. And from their testing, they found that 60 is a good number of vertexes to use in that ST subdivide. There is a speculation that it might be specific to certain types of shapes or geometries or relationships between them. And that is absolutely true. It always comes down, often comes down to your data, your hardware, and your queries. So if just because some general guidance is out there that kind of works most of the time, doesn't mean it'll always work for everyone. But that number of 60 is a great number to start with, I think. I have another question that just came in. Sorry. Have I leveraged table partitions or would that be equivalent to creating many materialized views? Table partitions, generally speaking, don't help with performance on the query time. They're more going to help on maintenance. So if you have a lot of time series spatial data coming through and you need to delete the older time series, you know, the time data over time and you've got a large amount of it, that might help your performance on the maintenance side of things of getting rid of your stale data. But from a querying perspective, it's likely not going to help on the performance side from an analytical end user querying perspective. Okay, now I'm not seeing any more questions coming through. So thank you again. I hope everyone has a fantastic rest of your Tuesday and I look forward to seeing you in a couple of weeks.