 Awesome. Thank you. You'll have to bear with me a little bit on the flight over. I managed to catch a cold So powering through a little bit. So between the jet lag and cold. I'll do my best here Really quick show of hands. Does anyone here not use Postgres a Few hands. Okay. Hopefully I can convince you by the end the rest of you. You're all my people So I'll talk to the rest of you So really quickly who am I? I work at Citus data. We turn Postgres into a distributed horizontally scalable database Think of it like as sharding without all the work basically scalable like no sequel all the goodness of Postgres. I Curate Postgres Weekly. So if you've subscribed to like Python Weekly or PyCoders Weekly Postgres Weekly is a similar newsletter Really focused on more app dev than DBA. So Hopefully you can find something you like in there. I Blog a good bit about just like tech startups Postgres as well Previously I was at Heroku and launched the Python support there So Python is definitely my world when it's not Postgres So today we're gonna kind of walk through different stages of like how to deal with you know Data at any size and what things you should be doing and caring about I don't know how many of you run databases that are like one terabyte Postgres can handle that pretty well Most people are probably more in the small medium category But there are some things you can do with those stages that make life a little bit easier or spend kind of the time investing So first starting a little bit small Before I dig into that why Postgres as a co-worker said it's the Emacs of databases and I think this is a really telling comment I'm a VI guy, but I think the the sentiment makes a lot of sense Emacs is as much a platform as it is a text editor Postgres is much as much a Data platform as is a relational database you see it to think about it. It's being really really rigid now It has a lot more in the box if we look at it, you know, there's a lot of really really rich data types Has probably more data types than any other database that I know of Rich geospatial support. It's well regarded as the most advanced open source geospatial database Index types will get to this a little bit later, but Postgres adds a new index type almost every year These are like Deep kind of an academic foundation, but can have really really practical use cases full-text search JSON B deserves its own kind of call-out Postgres got JSON support in 9.2, which now is six years ago, but it was just text validation JSON B is Binary JSON in your database So think Mongo like binary JSON documents that you can index search query on so you don't Have to have everything is normalized tables and then extensions Extensions really are kind of where some of that It's the emacs of database comes in now you're Postgres can get new features without having to wait for a year long release cycle So a lot of this stuff like GIS Full-text search comes on as as part of that extension framework. So really low-level APIs. I like you extend things All right, so When you're small, what do you need to worry about the first thing? I would say is pay attention to data types I talked to a lot of people that say I want to use the most generic thing possible in case I want to up and move my from Postgres to my SQL or Postgres to Oracle or something else the reality then is you're limiting yourself really really heavily to the You know the worst set of all things that are shared across databases SQLite is a great example of this where it lists null as a data type not as a value But as a data type I'm not even sure what that means in database sense but If you look and say hey, you know my SQL I want to treat you know text the same as Postgres My SQL if you insert something too long it just truncates it. It doesn't say that's invalid. It just truncates it I Care a little bit more about the quality insanity of my data So if I want something to be explicitly, you know say 140 characters long like a tweet Well, not so valid today, but I want to use like a varchar 255 or 140 there Time stamp verse time stamp with time zone Post dates are hard. Like there was a great blog post. I think a month ago that hit hacker news of like By Zach Holman. It was a fun read if you haven't seen it go read it Dates times are very very hard time zones date math all that sort of thing Postgres is really really good time stamp with time zone is probably what you want by default As data comes in it can convert the time zone for you It still stores it all under the covers as the equivalent of UTC, but it does that math for you, which is really nice Jason verse Jason B Postgres got Jason support in nine two and nine four We got what I would call real Jason support in nine two It was just text that was validated as Jason as it came in but it was stored as text under the covers Jason B is a binary representation of it, which is really fast. You can index it all of that if you're using something like a logging service or Recording just inputs from an API and you want to preserve white space use json. There's a valid case for it still to exist That's about the only case So the first thing is like leverage your data types if you have it in your application code as a certain data type Look at using the same in your database All right constraints This is a thing. I find most people don't intentionally put in but And just assume the application is going to take care of it The reality is without database constraints. You can definitely have invalid data within your database things like you know Limiting the the length of strings if you really want to you should have that same constraint within your database Postgres has things for like IP address data type. Is it actually valid IP address or not? Foreign keys are great so that if you delete, you know a user are you deleting all that users data? That's more of a concern now with GDPR and those sort of things so making sure you actually have like foreign keys So that as you delete something you make sure it cascades are really really important This is a lot easier to do when you first start when you first start you're probably thinking I just want to build something fast putting this in place early Makes all the difference coming back in and adding that you know years down the line It's a lot of work and you realize you actually got to go clean up data at that point This is one that's really common in the Rails community. I don't see it as much in the Python community But I think it's a really good pattern In Rails, they don't actually delete data Which I know is probably a concern now for privacy and other things But there's a basically a flag on every model they create That says deleted at is null and they just set that value and then they filter that out constantly So we're deleted at it is not null then you don't return that in the query This is really really nice. If you've ever accidentally run a query, which we never have You know delete, you know from users and then you forget the where clause And everyone remembers, you know Immediately after like it wasn't a full second as soon as you hit enter you're trying to cancel it and what just happened It's still a pain to clean up with you know update set deleted at But this is far far far better to clean up than where you actually went and deleted it and have to do a point-in-time recovery against your database So soft deletes are a pattern. I really really like and would like to see more of I'd encourage you to consider if they can work for you all So the other thing is I would say spend some time mastering your tools How many here have like a bash RC or like a Vim RC setup? Okay, most hands. How many have a P sequel RC setup? That's actually more than I expected maybe 10 to 15 or so P sequel is a really really great Postgres editor. It's the CLI tool that ships with Postgres Out of the box. It's okay, but what you can do is customize it It has a lot of really built-in richness so you can add things like tab completion Backslash X auto is an awesome feature that based on the width of your screen and the width of the query output Will auto format the output of the query so that either like word wraps or puts like a entire kind of record line by line by line One thing I like to do if you search for like P sequel in my blog I like to say the history of every command. I run against a database So for every database based on the name I have a history file if you've ever had someone come to you and say Hey, can you run this report for me and you hop into the database you run a query send in the report? Great, you're done Three months later. They come back and say hey, do you remember that report you ran for me? Can I get an updated copy of it? I'm like I have no idea what you asked for what I ran I go recreated. It's probably not the exact same thing by saving the history of every query I run it's great to come back through in some form of ad hoc analysis And I don't have to do anything extra by just setting that up in my PC cool history file Backslash Timing is a nice tool that shows the output on every query you run how long it ran All right, so backups who here has backups on their database Who here tested them within the last month? All right, so backups do not exist unless you test them Setting up backups from the start is good. If you haven't tested them like I know you think they work I don't know people saw kind of the git lab incident from maybe it was a year ago now and they tried not just five different backups like five different methods of backups and Failure failure failure Start a process of testing your backups. It doesn't have to be daily weekly monthly It's probably a good interval. I would go and do that now try to restore your latest backup and Let me know if it failed or succeeded If you're not testing them There's no point in even putting them in place because they're probably not gonna work All right, so early on I would say, you know invest in those things That's probably pre 10 gigs of data right a lot to a lot of small projects A lot of things that aren't gonna have a lot of users done a lot of data In terms of like medium-sized data now you need to start to learn a little bit more about Postgres You don't have to become a DBA or an expert, but a few kind of key things will go a long way Here we're looking at something probably like 10 gigs of data up to 100 gigs or so A few basic things you're gonna want to do Set up logging make sure you have log rotation The number of times I've seen where people set up logs and then don't set up log rotation Run out of space on their database because of just logs It happens make sure you do some these sort of things configure your memory Postgres isn't great by default at its memory settings, but it's not that hard to configure tweak vacuum adjust your checkpoints But the biggest thing here is don't try to be an expert in all of those settings There's a number of tools like PG tune where you can go and put in some some data about your database say Hey, I have this much data this size instance. What should all of these settings be? This is one area where they say like don't try to learn too much There's like 200 different settings and configs within Postgres Just set them once and then you're pretty much good And there's a few talks out here on this as well a great one from PyCon I think two years ago was Postgres QL when it's not your day job Kristoff Pettis walks through a bunch of set this config. Don't worry about what it means set this config if you're on AWS Don't worry about what it means to get you a good solid setup. All right, so the Number one thing I pay attention to when I first look at a database is the cash hit ratio Is anyone else excited that DuckTales has come back? I'm way too excited for this with my kids anyways So running this simple simple query against your database Will tell you the cash hit ratio Postgres keeps a lot of data about the queries it runs Such as, you know, which things were served from in memory which things went to disk as we know going to disk much much slower than in Memory you've got to look at a hundred to one a thousand to one difference in kind of performance time here This super simple query is gonna give you back something that looks like this and Here what I'm gonna look for in most Applications and this is for a transactional web app if you're in data warehousing. It's completely different But most web applications that I come across You want to see a cash hit ratio of 99% if this is too low The easiest things to do to improve performance is go get a bigger instance upgrade your database add more memory to the box Things will be happier When you go from 99% to something like 98 95% Performance doesn't go from 1 millisecond to 2 it goes from like 1 millisecond to like 100 milliseconds It falls off a cliff really really quickly so querying this on a Weekly monthly basis is a good thing to do Send it up with some of your regular alerts check it out and see when you need to upgrade The next thing I'm gonna look at is am I properly using index? So just like how many things are served from cash versus how many things are served from disk Postgres has index hit rate This query is gonna give us something that we have to do a little bit more thinking through It's gonna show the number of times I use the percentage of time I used an index against this table on my queries and the number of rows in my table As a rough rule of thumb What I'd go for is you know if I've got more than 10,000 rows, and I'm doing a lot of really short lookups. I want to have an index hit rate of 95% or higher This is gonna vary based on application Sometimes you may want it a little lower if you're doing a lot of reporting against one table But single row inserts you want to typically insert lookups updates Because you're getting a single row you want to have you know it in cash or using an index So here you're looking for a higher index hit rate And so here we could see like these top three tables we would probably want to come in and add some indexes on All right, so Looking at it from the other perspective if you've got a certain page in your application You're running some you want to improve the performance. You know it slow. You don't know why You can see the queries. It's running Postgres has something called an explain plan if you put an explain in front of any query and Run it it's gonna tell you what it thinks it's going to do from a performance perspective You want to explain analyze It's going to tell you what it thinks it's gonna do and it's also gonna run the query and give you what actually happened If there is a really really really horribly performant query be careful with running explain analyze because it's also going to run the query for you I've worked with Postgres for a little over 10 years now I Still look at explain plans and have to like pause Get some coffee and like look at what's going on there It's not It shouldn't be as confusing as it is So some rough guidelines first like for web application. I'm thinking okay Can I get you know really really common page response times down under 10 milliseconds now with things like single page apps? That's more like one second Common queries though still in my application. I want 10 milliseconds or less more rare queries under 100 milliseconds So coming back to this query There's a couple of things here. So if you can see in red there, there's two different values Those values that actual time is Separated by two dots. Those are two different values. I want to pay attention to the second one That is what it actually ran as in that case. It's 295 milliseconds to complete this step of that operation So knowing that hey if I want things under one milliseconds. I can see okay. This is Total runtime is slow and that one operation that's happening right there is a sequential scan not using an index So what do I want to do? I want to come in and add an index In this case, I'm just going to add it on salary because I've got that filter there Rerun this and that's going to bring it down to 1.7 milliseconds there's a tool if you search for understanding explaining there's a tool by This does pez. I don't actually know how you say his name just from the internet It's great. That kind of helps like take an explain plan paste it in it'll highlight the really bad steps for you It calls out kind of basically where you need to optimize and within the step So that helps if you know you've got a specific query But a lot of times you have an application, you know, like things are slow Can you tell me what like queries are all over I have all sorts of things Being executed for my RM. I have no idea what the queries actually are Postgres had this extension called PG stat statements It records every query that's run against your database essentially Paramarises it so the query that would have actually been run in this case would be select start from users where email equals Craig at Citus data comm it strips that Craig at site of data comm because it's different for every user in my application It records all these internal things like how many blocks were dirty to how many, you know Buffers were dirty how many new blocks were written. I Don't care about much of this. You can drill in if you want What I really want to see though is How many times a query was ran how long did it take on average and how much time is it consuming an aggregate against my system? So this one query will give you something that looks like This select ID from users was run A ton of times it averaged in milliseconds In aggregate it's run for 295 seconds against my database since I've started recording the stats I've got another query. That's this other select star from On average, it's 80 milliseconds and it's run almost as much time as that other one Coming back doing some of that math thinking okay If I know I can get a query down on typically to one millisecond if I were to go and optimize that second one I'm gonna get An order of two orders of magnitude back and overall performance against my database So really handy without like having to dig into the application code Which queries are being run from where and being able to optimize them? All right, so I mentioned indexes as a reason that Postgres is great Postgres has a lot of indexes most database is if you if you have a CS degree you probably learned about a B3 index in school You probably didn't learn about the rest of these these other indexes Jen Jis sp. Jis Bryn Postgres pretty much gets a new one every year. There's a group within the Postgres community known as the Russians one of them is a professor of astrophysics at the University of Moscow and For fun he acts on Postgres We have different definitions of fun, but they usually show up with some crazy You know academic paper saying hey, here's this new Index type that I read a paper on you know space partition Jists which is generalized search tree They show up on the mailing list. They're like hey, we're gonna implement this people like that sounds crazy They it's like they go into a cave and they come back three months later with like code and everyone's like okay That looks good. Let's commit it But if you read through each of those even in the Postgres stocks and kind of look through them like Which do I use? I have no idea I Read through I look at the papers and I'm like, okay I kind of understand what's going on here Maybe a little bit, but how do I actually put this into practice? That's a completely different story So this is slightly oversimplified But generally will work for 99% of times B3 this is usually what you want when you're doing a look up based on you know email address or A filter where you know salary is less or greater than this. This is usually what you want to use Gin So gin is useful When there are multiple values within a single column So if you think of a single data type a single column something like an array H store which is a key value data type in Postgres or JSON B where you've got multiple different values inside that JSON document This is usually what you want to use Generally search stream This one's a little more confusing. You can think about it as when Values overlap between rows. So if you think of like shapes, you've got polygons, right? You've got a circle that's in you know one row You've got another circle that is in another row Some of the values within those polygons within those circles can overlap some will not same thing with full-text search If you've got a full sentence some of the words overlap with words and other sentences some do not So if you've got Values that overlap between rows. This is probably what you want to use The simplified version is if you're doing things with geospatial stuff or full-text search Just is probably what you want SP gist and brin are generally used for really really large tables SP gist I've asked a lot of the Postgres core community like can you simplify it? And the only thing I've taken away is if you're working with phone numbers you want to use SP gist That's all I've understood thus far Brin is really really good when you're scanning a lot of sequential data And I say a lot like tables with billions and billions of records often time series that sort of thing There is also Another index type that's supposedly coming up So I mentioned the the russians are the ones that have contributed most of these Uh, they felt because we had jinn within Postgres. We also need vodka I wish that was only a joke that is the working name for the index type So we are getting apparently vodka, uh within Postgres as well So a few other index tips Be specific with your indexes. Don't just say create an index You can add composite indexes if you always query on first name and last name when you're searching Add an index on both postgres will be pretty smart and use those composite indexes Uh functional ones If you always query on lowercase of you know name if you want to have a case in sensitive search Make sure if you're sending in, you know, the lowercase version of it You do the same thing within your database Or conditional ones, uh, if you have A a lot of data, but some of it you're not usually searching on like a great example is If you're a phone book and Actually within the database you have every historical phone number in every place everyone has lived prior You're probably not frequently searching on where I lived, you know, five apartments ago So you could say where current address is equal to true and it's only going to index where current address is equal to true Really nice if you want to have an index on a part of your data Um, and there I'd look is, you know, do you have 10 times the amount of data? But you are you commonly only filtering on a subset of it? So within your database you usually have a big trade-off between faster reads and faster writes Um, I know the answer is you want everything to be fast Sorry, it doesn't work that way Um, so I a few months ago I created this tweet Um, this will give you the output to create an index on every column of every table of your database Uh, so you'll have indexes on everything It'll be fast on the read side This was mostly a joke, uh, but it's also not a completely crazy idea So when you add an index for every right you you make, um, it's going to figure out the query plan It's going to write to disk. It's going to wait for it to acknowledge that right and it's going to return This is about a kind of one-bill second round trip on reasonable disk Um, as you add an index it's going to also update that index and make sure there's no like constraint violation Or just make sure the index is up to date then it's going to return. So I We can just roughly say hey, I've added a couple extra milliseconds down And you add another index it's going to do the same thing. So it's going to have to do both of these And you add another one and it's going to do the same thing So I go and add an index on every single column on every single table Now I have writes that take like two seconds or something like that Not the best idea But as I mentioned postgres is really really good about keeping stats At that medium scale, it's not terrible to over index things Um, I wouldn't go all out and index every column of every table But you can be pretty generous with your indexes and then postgres Has this another super super simple query you can run That will show all of the unused indexes on your database So if you have an index that's setting there just slowing down writes But maybe your application code changed or the data distribution changed and it's never using this index You can run this query say hey, I don't need this and then go come back and drop it later So a good thing to do to come back and kind of start that process of cleaning up unused indexes And it'll give you something like this of you know, how large is that index? Is it really really large? Is it small? How many times does it use if you see a zero there on index scans go ahead and drop it? Um, it's not getting any sort of value from anywhere in your application So on database migrations Not null is a good thing, right? Like if we want to have some default value We want it in our you know application Uh, except when you're running a database migration Uh postgres if you add a new column And add a default value is going to rewrite that entire table It's going to take it and make a copy under the covers of it write that new value out What that means is it's going to hold a lock on all new incoming data while it's rewriting that table And you're staging this is going to be really fast. You're going to write things out and it's going to take you know a few seconds Um, even if you're staging has you know a gig as the table Production when you run this on a hundred gig table You're going to let it start running for a few seconds a few minutes Five to ten minutes and you're going to start wondering When it's going to complete and an hour in you're going to have people kind of over at your desk saying what's going on When are we back up or down that sort of thing Uh the better way to do migrations at any large scale Is simply break it up into three steps First you're going to allow nulls and set a default value not a database level Or at the database level and at the application. So all new data all new updates are getting that default value You're going to gradually come back in and backfill all the old data. So all of those hundred, you know, uh gigs of Records that you have that don't have this value in there. You're going to run a background job at night Update a thousand records at a time that sort of thing Uh, and then you're going to come back in and add your constraint Uh, you can do this now, you know terabyte sized tables Pretty much no downtime like you'll have a few seconds maybe of lock rights But overall much much safer and you won't find yourself bringing down production I think every person I've you know seen that's really useful working with small databases and then it gets bigger and bigger Runs into this with the first time they work with a a larger sized database of 100 gigs or so Um, and it's much much more painful the larger you get so be careful on this one All right, so that was kind of medium scale and I would classify that somewhere from like a database with 10 gigs of data To 100 gigs or so All right, the larger you get, um, you want to be a little more careful in particular Um, the first thing you're going to want to start doing before where you said create index Now you're going to want to say create index concurrently This could not be run inside a transaction. So inside like your your jingo transactions, you're going to have to break this out Sorry, it's worth it. Uh, create index concurrently does not hold a right lock while the index is being created Um, technically it does at the very very end, but that's for a few milliseconds What this means is you can you know instead of it doing create index and everything kind of Crawling to a halt while you add an index on a hundred gig table Um, it's going to build up that index in the background wait for it to be almost done then take that lock then cut over to it It's roughly two to three times slower, but doesn't lock rights really really key for larger databases uh connection pooling so Sequel alchemy has connection pooling built in jango has connection pooling now built in or at least a persistent pool I would encourage you also looking at running a like a Postgres side connection pooler so there's basically two ways of kind of connection pooling you can have a Your application having a bunch of open connections just waiting to grab one as a new request comes in So there's some latency there and grabbing like an ssl connection to postgres Uh, there's also the fact that postgres doesn't handle a lot of connections very very well Uh at a thousand connections you have a lot of contention. It's consuming a lot of memory that sort of thing There's a few options here PG bouncer PG pool are the two big ones I highly highly recommend using PG bouncer PG bouncer is a connection pooler that as a new transaction comes in It gives you kind of a a new connection. It watches for that to complete Uh, you don't have to worry about idle connections. There's a query you can run to see active versus idle connections against your database I've talked to a lot of people that say hey, I have two thousand active connections from my application I really really need a high connection account They run this query to show how many are active how many are idle and active is like 10 that means we've got you know 1900 plus connections doing absolutely nothing just consuming Resources not adding any sort of value PG bouncer solves this for you So scaling cache I mentioned this early on you know watching for that cash hit rate Uh, if you see a drop below 99 Go get a box with bigger memory Um, that's the easiest thing to do in terms of scaling At some level what you're going to do is run into a brick wall like aws only has an instance so large Uh, okay fine. You're going to leave aws run on your own hardware because you can get a really beefy box Um, then you're going to look at you know, hey, how long does it take to rack and stack that? Um You're still going to run into into a limit there. Maybe you can get a box with a terabyte of memory But if you keep growing you're going to run into the limits of single node Uh, the first thing that most people do is offload, uh, some read traffic, right Okay, let me have a read replica things are slightly stale by you know a few milliseconds or seconds But that's okay for these different models Um for replication. There's a few options Uh, I would generally recommend walley wall g or barman Uh, walley a colleague authored back in when we were at haroku We used it to handle replication for Around two million postgres databases when we were there. So I'm pretty sure it works at some decent scale We recently rewrote and released it as wall g um to be a bit more faster in performance and go wallies and python Give either a look barman is also a popular one. So I recommend either of these if you're looking yet Um setting up stringing replicas The other option is sharding. So sharding is the idea of splitting things up, you know Into smaller bits if you have one really really large table Uh, I usually in an application if there's one table that is like A crazy amount larger than all others. It's probably called Events messages or logs It may not belong in your database. You can probably move it off somewhere else um There are certain other data models that actually kind of fit really well Um, I see things like you know one database per customer or one schema per customer Please don't go with those two paths There you have other things now when you run a migration you have to run it across a million databases When you have 10 customers, this isn't a problem when you have a thousand it is Uh at things like a thousand schemas Uh postgres backup sometimes just don't work It's okay. We didn't test them anyways um The other option is sharding within your application Um, so sharding is definitely the Really large scale option. It's worked for instagram works for facebook google sales force It used to be really really hard. It's gotten a little bit easier Um, so one option, uh, not going to drill too deep into this. I work at cytus We turn postgres into a sharded database, but to your application. It looks like a single node So you're still talking to one node database. It's a pure extension. It's open source. So you can download it Just use it under the covers. There's multiple physical, uh instances there So still all the benefits of postgres still your application thinks it's single node postgres Under the covers it's sharded A lot of the principles, um Of sharding the instagram talk, uh, from a few years ago is a great one that, uh, Basically walks through the the best practices that sort of thing if you're thinking about that have questions happy to fill them afterwards Uh backups So when you start out, you're probably using pg dump Uh, that's the the one where you get kind of a sequel dump. You upload it to s3 or locally Uh, it's human readable. It's portable It works, uh The other option is a what's known as a physical backup within postgres This is the actual bytes on disk. Uh, it's a bit like operating system dependent You can't take like a winix physical backup and restore it to windows So it's not necessarily great for going from, you know, your production environment down to your local laptop because you're probably running a different os There's some trade-offs between the two of these Uh, logical that pg dump one is really good across architectures. It's great for going from production down to a local environment Uh, it's good for portability It does have some load on the database So when you run this you're going to see things spike in terms of a load and performance It works really well less than 50 gigabytes of data when you get larger at one terabyte I don't think i've seen a pg dump successfully work. You can kind of tweak it and maybe make it work It gets really really hard at some point. You're just going to have to abandon pg dump and go with physical backups Physical backups have a lot less load on the database. There's some load But it's far far less It it scales pretty infinitely And it's what you're going to use for things like read replicas as well and that sort of thing So over time you're going to want to shift from logical to physical backups All right That was a lot really fast So a quick recap Um Leverage your data types I think dha from the rails community once said like the database is just a dumb hash in the sky Um, I think very differently like the data is where a ton of the value is like if your data is wrong in your database Um, it doesn't matter, you know that you put a pretty ui on it There's a lot of value there and you want to make sure you know You're getting the most out of your data because it's one of your most precious things. So Leverage your data types Please test your backups for everyone that didn't raise your hand. Go test them. See if they work Um, and then take spend some time mastering your tools things like, you know, the data types some of the extensions Uh and p sql, uh, there's a lot of value in your database Having a good editor that you like if you don't like p sql go find another one But I would say p sql is already there and it's really handy. So give it a try As you get a little bit larger make sure postgres is well tuned You don't have to become a dba find a consultant if you need to Look at some of the blog posts or tools that are out there or other talks Um Really don't invest time I would say in becoming a perfect expert in all of the different configs There's a lot of resources out there use them Watch your cash hit ratio above anything else This is the number one thing that'll affect performance and then go ahead and index things There's a lot of people that when you're small you don't index anything Uh, I do get a little more, you know medium size. They still don't have indexes go ahead and just start adding them and see what happens Uh as the worst case if you spend a little more time saying, hey, I've got these slow queries I can index this thing that's definitely better But the worst case is just index a bunch of stuff and delete it later and see what happens As you get to that large scale Move away from pg. Don't Set up connection pooling. Uh, how many people here run a large scale database? Okay, a few hands. How many of you run a connection pool or like pg bouncer? Actually more hands than large database, so that's really impressive Uh, well done pg bouncer is the best tool there It's not perfect, but it's the best that we have and hopefully in time and core postgres gets better connection pooling Um, and then if you need to shard invest in the right way Um, look at the instagram talk use a tool to help Please don't go like down the one schema per customer path Because you'll go down that path and lock yourself in and a couple years later You'll have even more work than you had to put in initially. So Um invest once you get down that path And doing it the right way Try not to do that try to just scale up and up and up as long as you have to though because uh, you know Not sharding is a lot easier than sharding All right. I think that's it. Uh, any questions? Minutes for questions, please line up here Here ask your question Hi So I have a question about indexes because this got me reinterested I have this problem that I didn't have to solve myself But I heard about this many times about like scheduling problem when you have like let's say rooms and you have like Very flexible times. Yep, and you don't want of course to to schedule two meetings in the same room Yep. So would you use gist index for this to to optimize this? This is a query. So a couple of things that we do there. So one are you using range types? So, okay. So within range types, um Now i'm slightly blanking We actually have a blog post on the sinus blog about constraints The way you can have is a exclusion constraint so that things don't overlap I forget forget if it's related to gist or not. I think it is Um search for exclusion constraints and range types Um for calendaring scheduling like range types are a type that has a from and a to for people not familiar So if you have like a like right now, you don't want two talks going on at the same time or people, you know scheduling things for Uh registration class, uh, you can have an exclusion constraints so that things don't overlap there. Thank you Yep Thanks for the talk. I took some notes to pass it on my dba Um, I have a question. It's more like architectural one There's something out of some debate in my team about using the postgres But in a this, uh, you know docker as containerized Postgres cure need to docker whatnot Uh, why are you thought on that? I was a bit soundly weird to me, but I really have an experience It's a bit of a loaded question. Um, so There are people that run postgres in containers. Uh, we do for development for testing not for production personally, um Stateful sets are getting better within kubernetes. Um, it really kind of uh depends There's a A conference I chair out in san francisco postgres open We actually have some talks coming up in september on that of some places that run postgres within kubernetes Within docker stateful sets that sort of thing I'd say it's still a little bit early days people are doing it So it kind of varies and depends on your your risk tolerance and expertise You didn't mention the built-in replication of postgres well. So, uh, how do you feel about that? Yeah, so, um Built-in replication still needs kind of a little bit of help to set up with like something like barman or walley Because you need a base backup and then the the right-ahead log to catch up Um, if you're using replication, I wouldn't recommend using something like sloney or long dc Which are placement for the built-in So walley and barman are more admit to Supplement and help you use the built-in streaming replication Uh, it's come a long ways. It's still got a ways to go So the base one right now is Like a binary format logical replication came in in 10 like fully baked. It kind of got there in 9 6, but not quite Um I'm a fan of it. It's getting better and better Um, if you have like specific questions like happy to kind of drill in I'll find you on it at the beginning Using soft deletion, but I mean when using soft deletion you kind of lose the benefit of on delete constraints Do you have any good recommendation besides creating triggers from everything manually? Not right off that is probably the best one. Um Yeah, that's probably your your best case is just creating triggers and doing any application there. Yep Thank you for the top Good question. Do you have any performance advice for running a large scale database on rds? Use cytus. Uh, no, um So it depends on which large scale. Um, so like 700 gigs Yeah, um Optimize a lot of things. Um, like you're getting to a scale where It depends on what size instance you're on like as you get larger and larger like we run Cytus cloud on top of aws and see a lot of that scale Um, and we have customers up to like 900 petabytes or so or sorry 900 terabytes not quite petabytes Um, are there any tuning differences between running postgres on aws instance and moving to rds? Uh You'll get some more flexibility. So like you get a little more control running it on aws yourself Uh than rds. I would say rds can't scale to a terabyte. I've seen up to four terabytes on rds I haven't seen really performant ones be on that scale At that point your options really do look to be sharding of some kind To start to put things up into smaller bits be able to scale things out Thank you Okay, we are running a complex recommendation system that it's Running queries over very large tables and we're having all the time issues with performance And I guess that the recommendations are basically touching the index and to check with it because we have many indexes You all have issues But also, uh, don't you think that it's it's better than to have One grid from one side and not reaching a different in a different port this instance Uh, it yes, but it depends So having rights go one side and then reads on the other can actually create issues around replication lag So you can create latency and make kind of that replica like not safe or not able to catch up Um, so in theory, yes In practice some of the time some of the time I've seen it create more issues Um running queries against a read replica can actually create replication or create issues on the primary as well So it's not to say like oh that you have a read replica. It's perfectly safe to touch So there are kind of complexities there Generally that can be one option. The other thing I would say is if you're running a recommendation engine Uh, the order if you're probably doing scanning a lot of data So it's probably going to disk the ordering of data on disk is really important I would look into a couple of extensions pg repack or pg, uh, reorg which actually can reorganize data on disk Um, they're they're really interesting powerful extensions They do some really awesome things. They can also do some really scary things Like if you misuse them, they can just make the entire database unusable. So proceed with caution Uh, but if you are doing a lot of like sequential scans, they could be interesting to help too