 I'm going to be doing a talk on enterprise data architecture with Postgres. Here's my email address. I will make sure that the conference gets my slides. And if you guys have questions down the road, do not hesitate. We're totally open to either from the conferences or from the trainings we do, just questions in general moving forward. So a little bit about ProPass and kind of what led to this talk. ProPass is a company that we actually started life as a consulting shop. And recently, we've rebranded ourselves as ProPass, which stands for professional Postgres as a service. So our goal is to try to engage with our customers at a service level as opposed to on an hourly basis. And one of the main vehicles, if you will, that we've been using to do that is what we call the monthly service. So we build a monthly report. We gather the traditional metrics you would normally gather as a DBA. But we present this report in the context of data architecture and scalability. Because really, for our clients, that's the bottom line for them. As my business grows, is Postgres going to become a bottleneck for me? And how am I going to scale? How am I going to make sure that my database offers the same level of performance that I'm getting today when my company is two or three or five times bigger down the road? So that's what we tailor the report for. Of course, we talk about performance at the lowest levels, but we also talk about how we're going to scale and what those capacity points are when we start making some decision points about how to scale. And that really ramps up our monthly service, if you will, in a nutshell. We also have a service called the Monthly Service Plus where we basically either take one of our servers and we host or we will engage with the client on one of their servers and effectively add to our monthly service in order to turn it into sort of, for lack of a better term, an RDS experience. Meaning, we just take over all of the management. So the client actually gets a win-win because, A, they get that RDS approach where they don't have to worry about the server whatsoever. But, B, we can do anything we want for them on that server. So if they have a reason to deploy, for example, Sloney, which there are some really good use cases for Sloney still these days, we can do so because we have access to the server. If we want to deploy some scripts that sit on that server from a Cron perspective, again, we can do that. Whereas with RDS, there is no access to the server. And then, finally, we do a fair amount of training and even some consulting beyond the monthly service, but that's becoming more and more one-off for us unless what we do is standard. So today, again, based on our experience with this monthly report and how that cycle operates, that's what sort of led me to do this talk. And basically, I'm just gonna talk a little bit today about our experience in terms of how we've helped some of our clients grow and scale with Postgres. I'm not necessarily getting into any specific use this tool set or here's the magic formula, but just some of the approaches that have worked really well for us. So we'll talk a little bit about the monthly report and as a life cycle, whether we're doing it or whether you guys go into either the shop you're working for or if you're doing consulting and you take this approach, I think it's a really good approach to keep a company on track long-term and or big picture as opposed to just simply helping with tuning the problem areas and going after partitioning the tables when they get too big, which is all good, but we sort of couple that with an architectural or a scalability sort of big picture view. We'll talk about some of the most common things we do for IO performance and when we use tablespaces and some high level recommendations for indexing and when we do get into partitioning and then we start getting into isolating activity and then we often bring in PG pool and we do load balancing and then we start dealing with how do we fell over and there's fell over scenarios within say a region or within a data center and then also falling over to a DR site and so we've worked through a number of those issues and sometimes we get to the place where BDR or sharding becomes necessary. In one case, we deployed BDR so the client could for lack of a better term shard their connections, meaning each server now has less activity on it, therefore we've scaled and we've scaled successfully but it's not sharding in the sense of this database has customers one through a million on it and this other database has a million and one through two million. With BDR, of course, we're replicating everything and we could still end up with a bottleneck but as long as we are careful about load balancing the connections, it's actually a really easy way to scale in a pretty hefty manner and then I'll wrap it up with just some thoughts on maintaining reliability because sometimes we get in the door and a lot of the right things have been done in terms of scaling but the implementation approaches have not necessarily been well thought out and they become maintenance nightmares and therefore your reliability tends to go down the tubes. So in terms of the monthly report, we generally use PG Badger and PG Clue to do most of our low level database analytics that we like to use for this monthly report. We actually are probably a quarter away from maybe less than that, maybe two months away from having our own utility that gathers everything that we actually like to see in the format we like to see it and therefore we would be able to go in and look at, I wanna see data for all of my graphs at five minute intervals over the past 30 days or the past week or maybe I wanna see it rolled up to the daily instead of five minute intervals present me with daily summaries and let me see the last six months and we're gonna be able to do a lot with that but basically it's the same metrics. We also leverage and we tend to leverage whatever the client has already deployed in terms of OS monitoring. So a lot of our clients are using Nagios and right behind that, the second most common is probably New Relic. New Relic and Nagios of course both have Postgres plugins so we tend to pull those data points as well and then we will present the customer like I've mentioned with sort of a monthly DB review and this review is both on the tactical and the strategic side, meaning it's kind of a top to bottom database health check if you will that talks about low level metrics as well as where can we go from here and how do we scale in one report and we do that every month. So on the tactical side, we're looking at the sequel queries themselves, what are the slowest running queries and which queries is it really worth taking a look at in terms of doing some performance tuning, meaning if we've got a query that runs for 30 seconds and it's at the top of the list but it only runs once a day or maybe it runs once a week, it's not really worth spending a fair amount of time tuning that query but if we've got a query that runs 42,000 times a day and it runs for 30 seconds or more then of course we wanna start looking at those queries. We also look at checkpoints and checkpoint activity and we try to help make sure that the database is well tuned from a checkpoint and background writer perspective. We look at the temp page usage and the vacuum operations and we're looking at waiting queries and we actually look through the logs for errors and warnings and then of course we couple that with the memory and CPU and IO information so that we can actually line spikes up from a tactical perspective in terms of events that have been taking place in the database and where we need to spend some cycles to do some actual classic DBA tuning and then on the strategic side we tend to try to put together some very specific database capacity points meaning if we don't change anything at what point do we start having performance issues when it comes to all of the different capacity points so then we can start talking about when is it gonna become important for us to do X versus do Y from an architectural perspective and we can actually identify for our clients as we move forward in time and as the database actually starts to scale we can identify those capacity points as early warning indicators where we can say when we hit this point in capacity then these three things need to be considered in terms of scaling options and we need to do that within X amount of timeframe before it becomes a crisis for us so we're thinking long term and real high level but we're making sure that when the client gets to these places where capacity is starting to be squeezed if you will that there's no surprises that they know okay at this point we need to either roll out a hot standby and load balance our reads or we need to get reporting off of the main system and we've already talked through these options and we kinda know what the plan's going to be we also talk about data management and basically connection pooling and load balancing as options within that capacity framework the reporting or reporting like or style of queries and trying to isolate those within schemas early on as opposed to waiting till the last minute sometimes that requires using ETL processes and techniques isolating activity in general and at what point do we get into real sharding scenarios and what are the other scaling options so on the strategic side of the report we really zone into capacity and when we get squeezed for capacity when we hit those points what are our options what are we gonna do and have a plan of attack months in advance so the client doesn't suffer any everybody like their hair on fire and run around the room kind of thing so this is really what our life cycle if you will looks like meaning we generate this the stats and we're collecting these stats at five minute intervals throughout the month and we're also parsing the database logs in the case of PG Badger and then we do the analysis and actually compare to the previous report and then we put together the tactical and strategic planning and once we've got that put together we actually have a report to review we get together with the client at which point we actually review and prioritize because we have to balance what we would like to see done with what actually lines up well with from a business perspective on the client side meaning they've got releases and they're doing development and they've got other things going on and they have budget constraints and we need to make sure that we couple really well with their processes therefore we get together once a month and we look at priorities and at what point can we do X versus Y and if we have to wait longer than what the capacity points would dictate then what are our backup strategies what's plan B and in some cases plan C we use that also to sort of track progress so on a month to month basis we actually track where were we previously and where are we headed and what are the trends looking like from primarily from a capacity and from a risk perspective so one risk item might be this capacity point suggests that if we get to this many queries per second then we're gonna have to load balance our reads because 40% of our traffic is reads so on and so forth so we will trend how close we are on a month to month basis to that specific point and then we of course have a plan A and a plan B and so forth in order to address that once we hit that point one of the areas that straight away we get involved with is IO performance because of course postgres like any other relational database is gonna be IO bound for the most part so from a tactical DBA perspective we tend to look at things like help our, we recommend to our clients focus on disk speed and RAM over the CPU speed now of course we want as many CPUs as we can get but we don't necessarily need the fastest most expensive CPUs we just want lots of them also the more spindles we have then the better we'll be able to utilize parallelism within the postgres database engine of course in today's terms with all the cloud environments and so forth multiple spindles are becoming less and less of a concrete term if you will so going with RAID volumes is one really good way to address that so if there for example an Amazon shop and we're using EBS volumes we can create RAID volumes across EBS and give a significantly better IO throughput than we would otherwise what we find a lot of times is the clients simply pay for more IO bandwidth and that's not always the best plan sometimes we can do a RAID zero or a RAID 10 across multiple EBS volumes not necessarily paying for more bandwidth which gets awful pricey with Amazon and we can basically solve our IO problem at least for some period of time until we hit our next capacity point RAID 5 and RAID 50 we just we push back kind of hard on those because RAID 5 and RAID 50 are gonna give us really really poor write performance if we're a read-only database it's fine but we're never almost never just a read-only database unless we're reading off of a standby at which point the master is doing the writes and of course we're back to the RAID 5 issue separating our wall files moving that PGXLog directory elsewhere and allowing that elsewhere to be on a separate volume is one of the best things we can do in terms of it's dead easy to do it's not very expensive to do it doesn't take a lot of man hours to do and we get a significant performance boost out of the box in terms of IO because Postgres writes to the write ahead log every single time a commit takes place and flushes that right to the potter before it actually updates the data in the memory buffer pool the write ahead logs are of course by nature of that process the biggest IO bottleneck in the system so if we can migrate that to a separate volume we get sometimes a 30 to 40% boost in IO performance right out of the gate and if possible if we can use SSDs for the transaction logs or the wall files that's even better yet if we can do both then of course that's a huge win for us it's very low cost for the client and we generally push for that from day one because at some point in most scenarios unless our database fits in memory and 10 years from now we expect it still to fit in memory it's gonna be a problem at some place and then just a note about running Morning Postgres on Windows Postgres does have a Windows port however Windows as an operating system assumes that the applications running on that platform are threaded models and Postgres is not a threaded model Postgres is a forked model and Windows doesn't necessarily know how to fork which is why we run into scenarios where on Windows as we try to scale with Postgres it's a real struggle and it's not a knock against Windows it's just the matter just the fact that the port is simply a straight port it's not an actual conversion to run on what Windows expects it's just a straight port so if you have the choice because sometimes of course you don't and you can either choose Windows or Linux choose Linux because you'll get much better performance long term when it comes time to start looking at the checkpoints in the background writer which sort of seems to be the next step once we've dealt with some of the usual suspects in terms of IO framework then of course we need to start looking at tuning the checkpoints and the background writer because those have the propensity become the next big IO bottleneck in the system especially as we start to scale traffic significantly so the standard and I'm not gonna go too deep but basically the background writer delay and the LRU max pages and the multiplier will allow us to get checkpoints in particular tuned in such a way that the checkpoints aren't presenting a flood of our IO channel the LRU multiplier I included some additional information because there seems to be a fair amount of confusion about what the LRU multiplier actually does as we talk to folks mostly folks that are sort of self taught I read the docs and we work with some really sharp DevOps guys and some developers but they don't seem to necessarily understand what the real use of the LRU multiplier is when we start tuning checkpoints in particular we tend to watch the PG stat, PG writer table and look for how often is the background writer writing its max number of pages and going back to sleep even if there's more pages that could have written in terms of dirty pages in the buffer pool so one of the things we'll do straight away if we need to tune this system is to bump up the number of pages that the background I'm sorry yeah that the background writer is allowed to write the BG writer LRU max pages by default I believe it's set to a hundred might be 150 if memory serves and then if we need to we'll tweak the multiplier and then thirdly we'll tweak the delay but we try to try to not touch that delay unless we have to because that 200 millisecond delay in between runs is really key in terms of keeping your checkpoints from becoming an IO bottleneck in the system the real goal with checkpoints in the background writer is to make sure that the background writer is just aggressive enough that when a checkpoint does happen that there's actually not that all much for it to do a checkpoint is by nature a foreground process which means that it can actually block other processes and it's not gonna block by locking or actual queries waiting on it but it's gonna block just because of resource usage so given that it's a foreground process and the level of priority that an actual checkpoint has it has the capability to cause your database to seemingly come to a halt every five or 10 minutes or however often we're running checkpoints in terms of our checkpoint timeout the database because it's a relational database like any relational database needs a recent point in time that it can reference in its transactional timeline where it knows that every page in the buffer pool was synced fully to disk so it's sort of that necessary evil so the background writer of course allows the database to sync a bunch of those pages truly as a background process which means it has a lower priority and therefore when we get to the point where it's time to run a checkpoint there's simply less work to do it is somewhat of a touchy balance because if we're not careful we can simply migrate our IO problem from the checkpoint scenario to the background writer so we have to be very careful how we manage that and basically it comes down to monitor and test and tune meaning be very careful and don't change 32 parameters at once but change things very carefully and monitor very carefully and make sure that we're moving the IO problem in the right direction and secondly make sure that we're doing this early enough that it's not a crisis so we're starting to tune the IO level of the database long before it's a problem and therefore we can do some multiple cycles of testing and tuning without the client breathing down our necks and saying oh my god, what are we gonna do? There are a couple of other parameters with respect to and these are fairly new parameters 9.5 and 9.6 with respect to forcing the database to actually flush these pages out of the OS cache onto disk as these processes are moving forward or as they're executing so basically we assign a size value to all three of these the BG writer and the checkpoint and actually the back end and the back end flush after applies when let's say I run a large update and I'm gonna update millions of rows and large table and it has to pull all these pages into memory and the buffer pool simply isn't big enough so at runtime meaning at my query time it's gonna force other pages out of the buffer pool and therefore the back end flush is gonna dictate at what point that those writes back to the OS are gonna be flush to disk again trying to get to the place where we're not necessarily setting ourselves up for an IO bottleneck and then the last IO thing that we tend to look at on a regular basis and sort of our usual suspects is the effective IO concurrency. The documentation recommends that for example on a RAID 10 volume that we would set this to to one for every disk in the system. This setting only affects bitmap heap scans but the bitmap heap scan is actually a very common scenario that we see when we're running our explain plans and we've found that when we start to bump this up depending on really it comes down to not so much how many disks you have but what does your IO bandwidth look like and what is the level of parallelism that that IO bandwidth might support so we've actually had scenarios where we set this from zero to say 100 or 150 on a RAID volume of SSD drives and saw a dramatic, I mean going from like 60 second queries down to millisecond queries impact on specific queries now of course given that this only affects bitmap heap scans we need to make sure that we understand why we're bumping this up and usually it's some function that's reading some view that's probably nested or based on other views at which point we're seeing multiple subqueries take place because with Postgres a view will always be treated as a subquery therefore in those scenarios we're usually trying to tune a very specific situation maybe it's a handful of queries or maybe it's two or three different functions but in those cases this is a really good parameter to take a look at. So as we start to tune a system from a strategic perspective I'm sorry from a tactical perspective we also start getting into table spaces and indexing and partitioning and so forth. I'll be a little brief on the table space issue but basically we wanna isolate our high use objects our large objects sometimes just isolate our indexes all of which can help us scale the database and then of course when it comes to indexing simply tailor our indexes to line up with our usage patterns we see time and time again where clients have appropriate indexing but it's not really lined up with the usage patterns on those tables for example leveraging partial indexes when we're never gonna go after more than the last 30 days on this table but we have 15 years of data in this table and it's a four billion row table in those scenarios the partial indexing can be a real boom because now most of the queries are walking in index tree that only holds 30 days worth of leaf nodes in it for example or if we're always doing an upper or a lower or some typecast then we can use a functional base index and line that index up exactly with how it's being leveraged. When we do partitioning it's generally because we wanna come after it from a perspective of we wanna get partition elimination because we're going for performance but also we try to keep in mind that partitioning is one of the best ways to sort of roll off historical data without having to do mass deletes. One thought that we have or one approach that we like to take is to use a when we write the insert function to use this style of insert function as opposed to the if then else business. If we do this the approach and I actually have the code so when you guys get a hold of these slides there's a real working example in here this is gonna actually create the table if it doesn't exist and then it's a single insert line that basically uses dynamic DDL to generate that table name for whatever partition that this insert needs to happen into. Therefore we don't have to worry about external processes that are gonna make sure we've got partitions moving forward in time if you will and then we also don't have to come back and modify that insert function every time we add a partition because there are no if then else checks. It's simply, let me generate the table name based on the incoming data and then I'll use that as my insert statement. We can't always do this because of the nature of the data and what the insert function is being based on meaning what our partition key is but when we can this is a much better approach and I'm gonna sort of skip through the function but just note that this is actually a working example of an insert function that does exactly that. The only if statement in this function is basically if the table doesn't exist then go through the create phase otherwise it does nothing more than formats the insert. Next as we start to migrate from the tactical into the strategic areas, we start thinking about isolating our high use tables and indexes across table spaces and then migrating reporting maybe to a hot standby. So one of the areas we tend to look at with hot standbys is not only for fell over and for HA but if we can migrate read only operations over to that standby we can cut our load significantly on our primary database. In some cases we simply wanna migrate reporting or business operations of some type to a separate cluster in these cases we might use some sort of ETL operation. A classic use of this would be at moving into a data warehouse environment where we're gonna do a hefty amount of reporting and this is one of the classic uses of Sloney where we'll fire up Sloney and replicate only the tables that are basically considered the DSS base tables meaning the base layer for our data warehouse and then we'll modify those Sloney replicate tables so they're allowed to fire triggers and we'll basically fire an audit function and therefore when it comes time to do the ETL process and roll that data up in terms of aggregation and summary and creating our conformed dimensions within the data warehouse we get the diff process handed to us on a silver platter we walk through that audit table and we know exactly what to do in terms of inserts, updates and deletes dealing with our slowly changing dimensions and so forth and to boot we can do all that aggregation on the replica node because with Sloney the only thing that's read only are the actual tables and sequences that are being replicated as opposed to the entire cluster. And then the next steps are often looking at some sort of load balancing and hand in hand with that we begin to talk about possibly failover and we get into the HA scenario so PG pool is one of the tools we use very, very often not that we're necessarily partial to PG pool over say PG balancer in terms of a connection pooler but the feature set just seems to line up use case wise with a large number of our clients as they begin to scale. PG pool will support both streaming replication and Sloney on the back end. It's very easy to configure and it has a number of heartbeat and failover options in most cases you can configure it and write a simple script and it will actually pass into that script the necessary parameters for that script to know which node to reach out to when it executes for example the promote command in the case of a hot standby. PG pool will also always look for the first node in its list of nodes that is a writable node and simply deem it the master. So if I bring another node back online as a standby after a failure PG pool will generally automatically start load balancing across that node as well because it's in its list and it's not the first writable master it found. So it's very easy to sort of have node failures and reintroduce new nodes and so forth without impacting the applications. And then of course PG pool will do some formal load balancing for us it uses a black list and a white list for functions because of course with functions we use the keyword select which generally indicates it's a read so we can sort of black list functions that write and therefore still depend on the PG pool version of load balancing for us which we've done a number of times and it's pretty seamless as well. And then once we get beyond that in terms of really trying to scale we get to that place where within a single sort of master standby or master slave of some fashion that's just doesn't cut it anymore. So at that point we need to start thinking about things like BDR and sharding. I like to go BDR as opposed to sharding when we can because BDR which if anyone doesn't know is a third party utility for Postgres that does bi-directional replication of across up to 48 nodes. In one case we deployed this in a fashion where we had a node that handled different regions within the U.S. So there's a Western region and there was sort of the Southern region and then the Midwest and the Eastern region and this was a large mortgage and car loan company nationwide and they needed to isolate they had pretty hefty bandwidth but they really needed to isolate the traffic coming after these nodes as long as when you make a connection or when your application connects that it's generally autonomous meaning I don't need to make an update and have that update instantaneously available to one of the other nodes in another transaction then BDR is a great fit because BDR does asynchronous replication just like streaming replication does therefore again as long as your connections are generally autonomous it's a really nice way to scale. In some cases that's not doable and then we need to start looking at true sharding. So and that's sort of the last stop if you will meaning before you get into sort of a Postgres XL or a Citus DB scenario is we can start to shard put these customers on this node and these customers on another node or we could potentially shard based on what's actually happening from a business silo perspective so these business operations live on this node and these other business operations for example we have a client that does they're sort of like at cars.com but they're not cars.com but that's kind of their thing and so they're gonna be isolating based on regions so they're really sharding but they're sharding based on where the incoming connection is coming from as opposed to a classic sharding which might be you know do a figure out the customer ID or what have you. One of the things we've talked about with them is potentially as opposed to sharding by region is sharding by business operation so they manage the transaction end to end when it comes to buying or selling a car including the financing. So the financing could easily live on a separate cluster and then the selling and buying and looking at the cars and all the customer interaction up until the financing point could live on another cluster and they would be able to scale for some time based on current projections as opposed to truly trying to shard across regions. And then of course at some point they may be a good BDR fit but that's yet to be seen. And of course when we do start sharding true classic sharding we need to start thinking about connection methods and like I mentioned isolation of business silos or business functionality but basically how are we going to redirect that incoming connection in an efficient manner ideally so we don't have to make the classic hit the database and run a query to figure out where I connect to and then make my real connection kind of thing. And there's a number of ways to do that. I won't necessarily get into all those but those are at that level of scaling those are typically the things that we start to look at. And then finally in terms of managing reliability we usually get into the whole DR site scenario. So in a number of cases we will need to manage some sort of replicate nose across either a region or across data centers i.e. let's roll into a DR site. In some cases cascading replication and synchronous replication can be helpful there meaning we'll do synchronous replication into the DR site. Therefore if we ever truly need to roll over to the DR site we're not going to lose any transactions because that replicate node that synchronous node is not going to allow us to basically have control back when we execute our commit statement until that standby has actually received that update. It doesn't have to apply it locally it just has to have received it. Cascading replication of course is daisy chaining our standbys so we could have a scenario where we promote one of the nodes in our DR site and that secondary node is already looking at it as his master therefore we out of the gate have a master and a standby in the DR site. The rep manager utility will help us do effectively the same thing meaning we can tell standbys which node to quote follow with a follow command and we don't have to rebuild those standbys pointing to the right master. Leveraging PG pool is very helpful even going across a DR site or going across regions. In order to make sure PG pool isn't the new single point of failure in the architecture we will deploy the PG pool watchdog which is a feature of PG pool that allows us to fire up a secondary or even multiple PG pool instances and have them sort of watch each other and if any of them go down it can actually migrate an elastic IP or a virtual IP for us in order to prevent it from becoming that single point of failure. And then finally BDR application connection management meaning if we're gonna start to isolate either connections and load balance that traffic across nodes then or we're doing sharding. Again we have to think about how we're gonna manage those incoming connections in such a way that it's actually maintainable long term that it doesn't become a maintenance nightmare for us and that's usually one of the tricks is thinking about what it's gonna take to maintain these approaches or implementations as we move forward. So I know I moved quickly but I did wanna try to get through and ramp things up so we would have a little time for questions. So I appreciate everybody coming and hopefully it was what you were looking for. Any questions? Nada? Well thanks everybody, I appreciate it.