 Welcome to Loading PostGIS with OSM-2 PGSQL. This session is going to dive into how to install, configure, and use OSM-2 PGSQL on Linux to load open street map data into PostGIS. My name is Lindsay Hooper and I'm one of the Postgres conference organizers. I'm here with Ryan Lambert, who's the owner of Rustproof Labs. Ryan started working with GIS in 2011. When on a quest to update a roadmap, he started using PostGIS, Postgres, and OpenStreetMap. He has since been a contributor to the OpenStreetMap project since 2015. And he has spoken in New York on PostGIS and had it run on small and large scales. He's currently working on a book on how to use PostGIS and OpenStreetMap together. With that, I'm going to hand it off to Ryan. Take it away. Thank you, Lindsay. Thank you to everyone who has joined us today for the live version. I am Ryan Lambert and I'm really excited to be here today talking with you about getting OpenStreetMap data loaded into PostGIS. Today's session is the second of a series. The first one was about the high-level overview of the technology and the data and how it all fits together. That recording is posted already. The video is available and all of the links from the slide deck are available on that URL above. Today is all about loading data into PostGIS. And then the remaining four sessions after today really cover how to use the data and work with the data once it's loaded into PostGIS. So I hope you will continue joining me for this series as we work our way through this dataset. The demo is going to go over three different ways to load OSM to PGSQL, different switches and different commands. Then we're going to go over the process while those demos are running. The process itself is a little bit complicated so I'll kind of dive into what it's doing behind the scenes. And then we'll talk about how to prepare for this data load process and then running the process itself. If we have time I'll show a few example queries but that is hit or miss depending on how the timing of everything else goes to keep this within the window that we're looking for. So our goal for today is really to get our OpenStreetMap data loaded in. My focus when I run this process is analysis. I am not an expert in running this process for a Carto worldwide tile server. I'm loading this data in and using it for specific purposes other than a tile server. So some of the nuances of what I'm explaining may not apply to all of those other scenarios as well. So getting started with the demo, I have three servers I'm going to run this on. The first is a Raspberry Pi 4. I'm going to load the Colorado dataset. And then I have two much more significant pieces of hardware with eight CPUs and 32 gig of RAM. And those are both going to load the US West regions in two different ways to show examples of the difference which is we can do and the effect that it has on running. So first I'm going to flip over to this terminal with I'm logged into my Raspberry Pi 4. And I printed some information just to verify what we what we're working with hardware wise. We have four CPUs we have is a four gig of RAM model. We have a little bit of a swap file enabled, and then I've shown a few configurations from post grace itself. I've set shared buffers down to 300 megabytes. I've disabled parallel workers so I've disabled the parallel query functions for this little hardware. And then I've increased checkpoint time out up to its max value of one day. Now if you are not familiar with these details and nuances don't worry we will discuss this as we go through the session. So I have a command that I'm just going to copy and paste it's a series of three commands that kicks off this process and running the command the actual command that kicked it off is listed up here. And then I've pulled up the output of it so it's showing as it runs. One of the neat things with OSM to PGS QL is while it runs, we get a live output of its progress so they'll tell us what it's working on, how much data is processed and how fast is processing this data. This output becomes very helpful on larger data sets and longer running data loads for estimating completion time. So with that one running we are going to flip over to our second server. This now is a server with eight CPUs. It has 32 giga RAM. This is a digital ocean droplet very fast SSD drives. And here I've set shared buffers to eight gigabytes that's 25% of the system total. I've left parallel workers enabled at its default of two so we'll try to use parallel query if it thinks it's justified. And then we have our checkpoint value checkpoint time out again set to the max value. I'm going to copy over the next command for this server and paste it in. It right away starts processing and we get down to at the bottom here you see the processing line with the details. If you made a mental note of how fast the first one is running you'll probably notice that on a much more powerful server the rate of processing is much, much higher here. I'm not going to talk too much about the nuances of which switches I'm using or not using in this case. We're going to switch over to the third server. This is also eight cores and 32 gigabytes of RAM. The settings from post grace are identical to the previous one I showed you. The only difference on this one is we're going to use a slightly different command for OSM to PGS QL. I just pasted all that in again so the command itself is all the way up here already. The difference here is we've added a drop switch and we've also added in a flat nodes command. And those are the differences not going to talk about those differences until just a little bit later on in the slides. So with those demos running we're just going to let those run in the background and continue on the overview of the process. This is a very large process to get this data loaded. So what we're in a high picture we're taking that PDF data source and we're pushing all the data over into post GIS. This is an ETL process ETL stands for extract transform and load. And in the case of the way I'm actually running this. It's actually an EL TL process we do a load before we do the final transformations and loads. So it's a very complex data transformation process. Overall, what it does is looks a little bit like this this diagram comes from the OSM to PGS QL wiki, and it represents how the process works and where the data fits into around the process. The starting point is that pbf file. And if you'd listen in on the first session we talked about how that pbf format is extremely compressed much more compressed than what G zip can provide. And so it has to uncompress all of that data before it can load it into post grace so that is part of the cost equation for processing this is taking it out of that extremely compressed format in to get it it loaded. What's shown on this diagram is inside of post GIS so that pbf data gets pushed pulled through OSM to PGS QL and it gets pushed out into these database tables, each of those six objects represents a table in post GIS. But it's important to note the size of the object on this diagram does not represent the relationship between sizes of these tables. In reality the nodes table on the left side of the lower left side is the largest of all these tables. What those smaller shapes indicate is those are called intermediate tables in this process OSM to PGS QL loads a bunch of data into the nodes ways and rails table in order to build the planet OSM point line and polygons table. And if when I mentioned that drop command on that third server that I started the command on what it does is it removes those intermediate tables. Once the processing is complete. If you don't need those intermediate tables you can wipe those out of the database and free up a whole lot of storage space on your disk. And in all reality the end goal is the are these three tables we want the point line and polygon that are generated. This is where we build the rest of our analysis this is where we pull out the spatial data and all of the goodies from open street map. And take an approach to this process of using a short lived server. I don't see do this process on our production post grace instances, because of the amount of overhead required it takes an enormous amount of IO. It takes an enormous amount of RAM to handle this process. And the configuration for post grace that will get it to run the fastest and load the fastest is definitely not the ideal configuration for my post grace instances. So you want to this is one of the core reasons why I started doing it this way on a temporary instance was to offload keep that high overhead process off of my production servers. And this is to be very cost effective with cloud based servers with short term instances you can school up quite a lot of power for a short time and get the process completed fairly cost effectively. So this chart looks at the total cost to process a region with OSM to PGS QL. So we're going to be using all the tools spooling up the server, configuring it installing all the software, and then running the actual data load process itself. And we can see on the US West region on the left grouping there. We have three bars we have the green, which is Rig B, the yellow rig C, and the orange rig D. And the yellow one bar in the middle is the same level of hardware that I'm using on those two demos with eight eight CPUs and 32 gigabytes of RAM. And we can see that the US West region I can load for a few pennies. If we scale up to that middle grouping for North America. It takes about 50 cents to of expensive of expenses to process. And if I want to process all of Europe it's going to cost a couple bucks. And considering the volume of data that we're processing this dollar amount to get the data chewed up and it loaded into post GIS is very minimal. So with my temporary servers, I do this process on a single node. I don't, some folks might split out the OSM to PGS QL site on one server, and have it loaded into post grace and post GIS on another server. I have found it's much simpler to just use a single instance and configure them to play nicely together. If you split your process out into two servers then you have to also take into account network latencies and bandwidth. And I don't have any good data on what kind of network speed you need to be able to actually process it faster on two instances it's been quite a while since I've tested that. So I just kind of use one instance it keeps things simple and it gets the job done. So OSM to PGS QL pulls in the pbf file on the top left and it loads data into those three tables that we had looked at earlier the point line and polygon tables. In my processing I have another step that runs after this that handles the transformations that's the PGS M project that I that I've started in it what it does is it converts the semi structured data set from open street map and it converts it into a much more structured data set designed for relational database usage. The semi structured key value pairs just isn't sustainable for large scale analysis the way and if you like relational databases and are familiar with normalized data structures those semi structured data sets can be quite tricky to work with. So I do that extra set of transformation and I won't talk much more about that today and then I use PG dump to get the data out because remember I'm using a temporary instance. This thing isn't going to live beyond the processing of it so I have to get the data out to load into other servers. But once I have it out I can take it and put it into a dev server QA servers prod servers as needed. So to get ready for this process it is a bit heavy handed. There's a lot that goes into it. There's a few details you should kind of plan out at a high level first how big of an area do you need. I use Colorado a lot so a lot of times I can get away with downloading just the Colorado sub region from geofabric. Sometimes I need a few states at a time so I might be able to use US West or I might need to load all of North America. But getting an idea of how much data you want to load is important because it's going to affect how much hardware you need to get it loaded in a reasonable amount of time. You might need to download and load more process multiple regions. So you want to take that into account if you have to run this process multiple times over different areas. And then if you how often do you need to update your data sets. Do you need up daily updates or as weekly or monthly enough. The size on disk is an important consideration here. The going back to that pbf format the pbf format format of data is extremely compressed. And so when we take out that data from that compressed format and we put it into post GIS and we also add spatial indexes and all of those goodies over our data sizes are going to grow quite a lot. I have found five to 10 times is a reasonable bar ballpark to expect your data to grow it's going to depend on the exact region of data that you're loading. But five to 10 times the size when using drop to get rid of those intermediate tables is a decent ballpark. You need to do need to consider your total disk space to. And this goes back to post grace configurations as well. Down there I have listed post grace logs. My production servers have a whole lot of logging enabled so I can see what kind of activity is going on on those servers. I decided that's an acceptable load for production servers to have. But in this type of data processing, enabling all of the logging on post grace is going to consume a extra amount of space that you might not be planning for. So you just have to be planning for all of these different elements of this process that take up space on your disk and ensure that you have significant enough space to handle it. So if you're doing local virtual machines and provisioning your hard drive space to try to match what you think you need. This can get kind of tricky. The day the flat nodes portion if you are using flat nodes. This is something you need to be aware of is the it's a big file. This file continues to grow as the OSM data continues to grow the the file size here is derived from the max number of the node IDs I believe and so as every as new nodes are added that max value keeps going up. And so this file will continue to get larger in size over the past 12 months I found that this this file increased by about 17%. So it's big and it's going to get bigger as time goes on. It's just something you have to plan for if you're going to be taking advantage of the flat nodes option. So the load frequency is up to you depends on how much data you need and how often you need it refreshed the geofabric download server that I use provides data updates once a day. So at most using this format you can you you can do daily updates. I typically am in the range of weekly to monthly I don't need to update our data every day but a lot of times I know changes have happened in open street map that will affect the analysis I want to run so I'll go ahead and pull a new file every week or every couple weeks a lot a lot of time. But depending on how often you want to load the data you'll want to change how much you automate and how much you document. I love automating things. This is a long running heavy process and I believe automation is a very good thing here. And when you're getting started on this start with a small area. The these processes take a while to run and failures and our into the process are quite frustrating so if you start with a small area, you can really get your automation and or your documentation worked out to be pretty accurate and then you can scale up from there. With the post grace configuration considering it's non standard for me anyway, configuring the autumn automation of that configuration is very important. You don't want to be making mistakes on your configuration for this on a long running process because you really won't know that it's messed up until it took twice as long to run as what you expected. So automating here reduces the ability to make mistakes and it will make your life a little bit happier while running this process a lot. I like to use Ansible is my core for automation it handles my playbooks handle all the configuration installation steps. You can use CI tools I've also used Jenkins Docker allows you to automate a lot of that stuff in various ways. So take your pick of tools and figure out a way to streamline it. I like for my small area because I'm going to. I recommend you start small I'll recommend an area to use I like the Washington DC area. There's a lot of really neat topology to start with. So it's a cool area to look around with a lot of cool landmarks and it's just a neat layout of a city, but it's also really small. It's 16 megabytes and the pbf format and with all the intermediate tables and everything it's still under 300 megabytes and post GIS. It's a very fast to load process even on relatively small hardware that allows you to test and smooth out your process without a whole lot of time consuming states. Once you get your process working on a small scale. I recommend taking it up a little bit at a time, you know, if your final goal is Europe that's fine, but you don't want to jump from loading Washington DC which is 16 megabytes to loading Europe which is in the I think the range. There's a whole lot of nuance that happens with your hardware and how this process is going to run on file sizes between the smallest and the largest. So take your size up incrementally and make sure that you have your bugs ironed out before you try to run this on Europe or a full planet. And when things don't go right document what you did that it didn't like and document what you did that works. This has been very helpful to me to some of these process problems that I have encountered with this process are intermittent and far between you they're not. They don't happen all the time and so if you don't document what happened two months later you may not be remembering all the nuances about what caused the problem so this helps out a lot at least for me. Installation post grace and post GIS is fairly straightforward. I won't talk very much about this other than you do need to install the database and the GIS extension. I'm using post grace 12 and post GIS pretty much across the board at this point. So I'll be using those for the demos. OSM to PGS QL on the other hand is a little less straightforward. There's been a lot of activity in the past few months on this project and that activity has produced really good results. The latest and greatest version of OSM to PGS QL is 1.2. I very much recommend that you use that version or a newer newer version as time goes on. If you're installing the default versions with packaged with your OS you're probably getting a quite out of date version the point nine six and point nine four branches have have some shortfalls that have been fixed in later later versions. So I definitely recommend you use the latest version. Unfortunately to do so you're probably going to have to build it from source. Not always the happiest thing to tell tell folks but it's fairly straightforward. These instructions came from the OSM to PGS QL wiki and this installs all of the prereqs that you need in order to build it. And then you simply clone the repo and do a make and install and then you get the latest and greatest version. And this process takes about 10 or 10 to 12 minutes to run typically. And it should be automated. I have I look through my ansible playbooks in order to do presentations like this because I honestly had no idea how this was installed anymore. It had been quite a while since I wrote that playbook and I'd never manually do this so this is a good thing to forget how to do. All right. So at this point we have post grace post GIS and our OSM to PGS QL tool installed. So now we need to get post grace ready for the process. And our goal here is to get post grace out of the way. We want to reduce the IO as much as possible is really the main goal. Once you've done this then there are other goals you can focus on but the most critical to be concerned with is getting rid of as much IO as you can from post grace. And one of the important changes there is the wall level starting with post grace 10 wall level was changed the default value is changed from minimal to replica. And then most production servers that is the right change for the default, but in the case of trying to reduce wall and considering that or the right ahead log. And knowing that this is a temporary instance, the right ahead log doesn't have much benefit for us. If the server crashes I'm going to start the whole process over from scratch anyway there is no recovery on this process. Did I set that back down to minimal to get it out of the way as much as possible checkpoint time out I pointed that out earlier. That is set to the maximum of one day. And really this just tells post grace don't bother writing those checkpoints based on time only write them if you actually have to. And so that gets some unnecessary background activity out of the way for this processing. For shared buffers. I typically use 25% of total, which is fairly standard post grace wisdom. 25% of your system total for shared buffers on larger hardware that is fine on the raspberry pi or other similarly hardware is challenged hardware. It's going to be over aggressive and you'll probably end up with some nasty errors later on in the process. So be a little bit more conservative with that if you're not if you're using less powerful hardware. Some extra considerations to take into account with post grace 12 just in time has been found to be causing some performance issues when you're loading updated data. And I want to stress that this is not an issue with just in time itself and post grace. It's really an issue with the estimates behind the scenes inside of post grace. And that has been a long known reality that those estimates were bad but it didn't have really any negative side effects until we started having some of these more advanced features coming in. And what happens when when you're running the update process for osm the pgsql is it runs these queries behind the scene and the estimator says, Oh, well this query is going to return this many rows and be based on the number of rows that thinks it's going to get back. It's going to decide that using just in time is going to be helpful. And when those estimates are wrong and inaccurate the decision making processes flawed. So I personally have not had a performance hit by leaving just in time enabled. So you will need to test on your systems and your data process specifically to see if it affects you and your process. There is some work on both sides of the aisle both on osm the pgsql and in post grace to try to fix the fix this problem and make the make it go away basically. Another consideration with post grace 12 is parallel query finally works for post GIS data. This query has been in post grace since version 9.6 so quite a few versions now and it's gotten better every version, but up until post grace 12, it really didn't do any benefit to post GIS queries. That has changed post GI post grace 12 now allows post GIS queries to take advantage of the parallel query. And in the case of my data loads, it runs quite a bit faster 15 minutes versus over 20 minutes is a big deal in my opinion. Also related to the just in time issues. There are users however that have had performance hits by leaving parallel query enabled. So that's why is a good example of that it has such slow IO that enabling parallel query really doesn't get you anything much anything else because it's already waiting on the IO layer. And as far as parallel query goes just under half of the available cores is where I've kind of found that sweet spot with a server with eight CPUs, two or three workers are fairly close three is slightly faster but not by a large amount. Again, it's one of those you'll probably need to test it with your data in your process to see if it's beneficial to your workloads or not. All right, automation and documentation is a good thing. One of the best things about that I've had one of the best benefits I've received from writing my own blog is I write my procedures publicly, and that means I can Google my own procedures to find what I need it's very helpful to me. But the content I write on our blog about this type of process. Though that is my documentation. These are my procedures for and really closely reflect the way I do run this process on a regular basis. I recommend that you know anywhere that my process deviates from yours document what's different until you have a you've done it so much that you have muscle memory. After eight years of running this process I still find myself looking up my own tricks and tips to go wait, where I said that. Oh yeah that's right. There's just so much to remember. So post grace is configured we can now go ahead and get our data source. This W get command will get the US West region. You can download the data source you can see that it's a big file this relatively large file the US West region is currently at 1.8 gigabytes in that compressed format. A very important piece to do when you are downloading this data is to verify the checksum. The fabric provides the MD5 checksum it's the same file path file name MD5 download that and verify that what you downloaded actually is a good file. When you do not do this step is when you'll find that you got a corrupted file and sometimes it errors out real fast sometimes it doesn't. Sometimes it's obvious that you got a corrupt file. Sometimes it's not so verify this checksum and it'll save you a whole lot of pain and troubleshooting later on. So to create a database to hold our data for a little while. I like to use the PG OSM name is I've used that database name for this process for years and years at this point. The extensions you obviously need post GIS for the GIS portion. Your extension is required if you want to use the H store data and that's that key value paired semi structure data from open street map, most likely you're going to want the H store data. So we've put configured post grace we have a database we've installed some extensions now we can run the actual OSM to PG SQL process. When you start running this process if you are brand new to this please for your sake start on the conservative side with your settings. Keep your cash lower keep you know set your number of processes down if you want, but don't try to be too aggressive on your first very first ever run. It takes a while to get the balance of where it is on your hardware. I want to remember especially that this is sharing resources with post grace and the operating system that detail is more important on smaller servers the larger servers seem to be able to handle that sharing a little bit better. They have a little bit more overhead and wiggle room but on the smaller hardware you really want to err on the conservative side. This is the command that I ran on that third server that I started. It uses pretty much all of the features that I regularly use in this in this process. The best place for learning more about all of these switches and what they do is on the wiki under that under the docs folder there's that usage md file. That's the most accurate and up to date resource on that on the information about all the switches and options here. So I'm going to go into and look at each of these switches in a little bit more detail. So each store as we've talked about is what allows the semi structured key value pairs this each store option is the basic one there's a few other switches you can add on here. If you want to index this age store column if you're going to be querying it a lot you can add a gin index to the age store. You can also add all of the attributes into the age store call data field. So even if it might have its own dedicated column you can force all of the data to go into the age store if you like. Multi geometry. I've never ran this process without it, but that's because I do this use this for an analysis type of purpose. So if you're running a worldwide carto tile server, you probably don't want multi geometry I believe those work better without this. But what this feature does is it keeps those complex shapes in one place, instead of splitting it out into multiple rows and from a relational database perspective we want those related objects all in one spot. So you probably want this if you're doing any type of analysis on the data slim and drop. These are two different options but they are very closely related. The fastest way to load data is using them together. Slim does is it puts it creates those intermediate tables that we looked at earlier as the nodes the ways and the rel table. So if you have a really large hardware and lots and lots of Ram, you can get away without using slim because it can put all that information in memory for the most for most of us we probably don't have that much Ram available and we need to use slim. Also if you want to run updates on the data you have to use slim. The drop is a really neat feature because it gets rid of those extra intermediate tables after the fact, considering I never update the data I always load completely from scratch. This is the way I always do it. And the benefit, other than getting having a smaller on disk size at the end is it uses the unlocked tables feature in post grace. Folks that have ran this process since the pre 1.0 versions of OSM the PGS QL, you may be wondering why I'm not using the unlocked flag. And that's because it's gone in 1.0 1.0 and later they automatically toggle that option when you use drop. That's the only time that feature really helped or was appropriate to use and so they just automatically enabled it for us. So that option is completely gone and it's handled automatically behind the scenes. So unlocked feature in post grace has been around for a while and it's a really cool feature for data ingestion for very fast data ingestion. It bypasses the right ahead log or the wall in post grace. That has a side effect. The benefit is it reduces how much I owe you have on your desk and that's a benefit for this process. The other side of that coin is it makes these tables not crash safe. That's what the where the right ahead log comes in as it keeps it crash safe. Without that right ahead log the table is no longer crash safe. Again, it's a temporary process on a temporary server and it's intermediate data so I really don't care. All I care is that I can get data to load 12% faster. That is my goal. So the nodes file we've talked to just briefly about this a couple times now the nodes file is allows post grace allows bypassing post grace for some of the data ingestion that nodes table that nodes intermediate table can be not used and instead it shows all that data into a flat binary file on the hard drive. You really want a fast disk for that file because it's a really large file 54 gigabytes and this really comes in play when you have larger imports. If your pbf file is under one gigabyte don't even bother with this. It will not help you it will run slower it will not help you. If you're if you have really fast SSDs and your pbf file is at least one and one to two gigabytes or larger. This might help you'll want to test for sure. Once you exceed about 15 gigabytes for your source file, then this is pretty much de facto standard. As I mentioned this is a 54 gigabyte file and it doesn't matter how large or small the source file is Washington DC 16 megabytes pbf 200 less than 300 megabytes with the intermediate tables it's still going to write a 54 gigabyte file. And that's why I say you don't want to run this on file what want to use flat nodes on source files less than one gigabyte because the trade off just isn't there. So the trade off is writing data into the planet OSM nodes table. This is that intermediate table when you load Colorado and you let it load into the nodes table you're going to get 26 million rows of data. When you load us west you're going to get 228 million rows of data. That's 14 gigabytes on disk that it has to write. So the trade the race here is can the server write a 54 gigabyte flat file faster than it can load 228.7 million rows into post grace. So that's the trade off there it's got the all of the relational database constraints and all of that it has overhead. And so the trade off is which is faster. And it depends your hardware is going to be the main determining factor on if this is beneficial or not. I have found it to be beneficial with files as small as that US West file that we're loading. And that's why I did one demo one one command running without flat nodes and one running with flat notes on to the cash. So the cash setting for OSM to PGS QL some general guidance is about 75% of your total RAM if you have a good hefty server. The smaller your hardware and the less RAM you have available in relationship to the size of your data, you'll want to be more conservative and pull that back smaller servers you may want to put it as low as 50% on a Raspberry Pi I go even further down than that I put it down like 3540% if you over it over allocate your cash here, you can cause yourself a bad alloc later. Sometimes if you go way over if you're on a server with four gigabytes of RAM and you tell it to use 50 gigabytes, it's going to go haha no. It's on that cusp of maybe it'll work and maybe it won't it'll depend on how much RAM post grace needs and the OS needs. If you're on that edge margin, it might run for a while, but then you'll get bad alloc. Be aware that is going too aggressive on these settings can cause headaches and errors later on. This is the same command I just brought it back up to kind of refresh what we've been talking about. And I have a slide here for output, and that was only to take over if my live demos don't work. But luckily for me for us our live demos have at least started completing. This is the third one I started. So this is rig C number two. And I'm just at the bottom here I just brought up that command that I ran to cause the processing that we see above. So we used slim and drop. We have cash set to 75% of the max of the total 24 gig using each store and multi geometry and flat nodes. The process took just over 1000 seconds and this is roughly about 17 minutes I think quick math. But so this server was able to load US West in roughly 15 to 20 minutes is a good ballpark. If I compare now to the other one that we ran. I'll pull up the command here. This one's using slim but not drop and nowhere in this command is there the flat nodes, we did not use flat nodes so it put the data in the nodes data into post grace in that intermediate nodes table. And this one took 1800 seconds. So this is about 30 a little over 30 minutes. So we're looking at 1520 minutes versus about 30 minutes. So two options the drop and the flat nodes helped allow this data to run much, much faster. So that's in there, and we're going to real quick check to see how our pie is doing. So this is the pie and we can see it hasn't actually finished yet it's still working on creating indexes on it finished creating the indexes on the points. So we're going to have to build indexes on the line and the polygon layers, just an illustration that smaller hardware does absolutely take a longer time. Back to the other two, the, we're going to look at one more number here. Let's set of numbers from that processing line. This is the line that interactively updates as the data is loading. And we can see here that without flat nodes and without drop, we had we were loading just under 500,000 nodes per second on this server. If we flip over to the third server that did use flat nodes. We're now processing nearly 2 million nodes per second, much, much, much faster because of that flat nodes, the data size is large enough that flat nodes is able to write much faster than post grace can handle the same task. The offset of that is the ways processed about 56k per second. Without flat nodes, the ways process faster because in post grace, it's able to take advantage of the spatial indexes. So there's a lot of tradeoffs going in behind the scenes on these options of where the cut point is, is it faster or slower to use this option or not. And again, it's going to come largely down to how big is the data that you're loading and how powerful is your hardware. So coming back over here, I'm going to flip through these slides, we don't need those. Looking inside post grace, this shows what you can see if you log in to PSQL. I can see that the data size of that PG OSM database is right about 10 gigabytes, much larger than the 1.8 gigabyte source file that we started with. This has a whole lot of extra goodies to help help post GIS do what it needs. Looking again, this is looking at the tables, rows and size on disk of the tables that are loaded. This is what's left after drop runs. You'll notice that Rhodes table at the bottom that I haven't really talked about. I personally don't use it so I have don't have much to say about it other than it is there. We can see that the polygon table here. This, if you expand that from scientific notation that turns out to be 10 million rows of polygon data, 8.6 million rows of line data and four and a half million rows of point data with our data sizes. So we've got a good amount of data loaded into our database here for us to play with. And coming back to our big level picture, we've loaded data through here. And normally what I would do is I would do another transformation, but for the sake of time today we're going to skip straight to the step of getting the data out of the system. If I want to get the data out one way to do easy way is just PG dump into a flat text based SQL file. This is relatively fast and reliable. Notice it takes two minutes to get the data back out versus 15 plus minutes to load. The downside with this format is the size. Our raw SQL file is going to be quite large compared to the original source file. So depending on what you want to do with this data and your network bandwidth and all of that, this could likely be a problem on its own. Another option around that is to use the directory format of PG dump. And this allows us to use parallel processing to dump out the data in parallel and we can also add compression at the same time. So while this one results in a much smaller file size, we're now at four gigabytes for this version, the time to process this data output is about five minutes. So it's going to take longer to add in that compression. It's all a matter of tradeoffs and your infrastructure and what you need to do with this data. Getting data back in, you can also use parallel processing to load data while it's decompressing it. And so with this we can get data load in in just a few minutes. This is about a third of the time or less of the time it takes to process with OSM to PGSQL. The time is one reason why I do it this way because it is faster. I can, I can load this into a production server fairly quickly, but it also has a whole lot less overhead than OSM to PGSQL does. So even though I've got, it's going to take me a few minutes to restore this dump into a production database. I don't have that massive amount of RAM consumed, and I don't need anywhere near as much IO. It needs a lot of IO, the more the better, but it isn't as critical as it is for OSM to PGSQL processing. I like starting small, because when you get errors, they happen late. And I really don't like errors that happen late. This is a good, an example I caught of getting bad alloc. I had actually ran this process with cache equal 200. So I thought I was pretty safe on it, but it turned out to not be happy. And I got this nasty bad alloc. To go into the timing of it, I pulled out the details of the processing speed. So how many nodes had it processed and how long same with ways and relations. So I took all of that and I put it in here and I calculated that it took just over an hour for the process running before I hit that error message. That stinks. What do you do to get around that? Well, one, use the latest versions. 1.2 is more stable than one, which is more stable than the pre 1.0 versions. So if you can use the latest version, I highly recommend you do. Shared buffers and cache be more conservative with those. Cut those cut those back and make things run a little slower for the trade off of stability. But really the best way to get rid of this kind of error and to speed up your processing is to add more RAM. That is the absolute best way to speed this process up. The servers that I've tested this advice on on my configuration and the processing in the time spreads from fairly small servers to relatively large. I personally have no need to test on servers with more than 16 CPU or 128 gig of RAM. So I haven't done much testing beyond there. For our processes, I've been able to get the timings that we need with that hardware or smaller. One of the little charts, I like this one because it shows the performance cliff in relationship to the size of the hardware and the size of the data on US West. You can see that everything's fairly close. The rig B has four CPUs rig C has a rig D has 16 and on US West pretty close timings. But once you get up to North America, you see that green line shoot way up. Well, North America's data is enough bigger than just US West that it exceeded a threshold on that hardware and so it just falls off a cliff. And you can see the same thing with rig D between North America and Europe. It just falls off a performance cliff where it takes more than twice the time of the next size of hardware. IO is also a good thing. RAM is important for this process but IO is really good. This screenshot that I got shows OSM to PGSQL pushing almost 800 mega second to disk. And I think if I could give it more, it probably would be able to push more. On the thread of automation, I'm not going to go deep into this. I will provide this script on the page with the recording after it's all published. This script basically combines all of my steps into one process. So it creates the database, it creates the extensions, puts a place, downloads the data into a folder, and then it has the command that I need to run here. This particular one doesn't have flat nodes included. I have another version that has flat nodes included for when I do want to use that option. But I don't run those individual commands. I run one script that wraps it all up into one. And while I would love to get into the D Beaver demo today, I'm going to skip over that. It was really just a precursor of what's up next. And the next session is really all going to be focused on querying the data, exploring the data, looking at the data, and what to do with it now. Thank you everyone for joining in. I hope you all have a fantastic rest of your Tuesday, and I look forward to seeing you next time.