 If you're interested in this talk, if it doesn't suck enough for you to follow up on it later, I'm going to be tweeting out the slide deck and all that information, so follow me on Twitter. And so let's get started. Like many stories, this story starts out in the year 2000. I was a recent college grad, I had a degree in chemistry, and so I decided the best thing for me to do with my career was to pursue old-timey sign painting. Somebody actually made this by hand, it's pretty cool. I wanted to be able to do that. So I did what any brash 21-year-old guy would do. I went out and bought some badger tail brushes, bought a couple cans of one shot, and started doing cold calls on local businesses, just sort of walking in and costing whatever random person happened to be walking there. And yeah, I didn't make any sales, which turned out to be a blessing in disguise because I'm a horrible salesman, but I'm a worst artist. But little did I know, little did I know that I had taken one step on the road towards becoming a virtuous programmer. And I'm telling you this story because I want to get the worst out of the way first. It's kind of an embarrassing stupid story, right? But it kind of makes sense here because we all kind of do this. And we all kind of go in and promise things that we're not quite sure how to deliver. And then we go and figure out how to deliver them, and I think that's pretty awesome. I think that's a great thing about our community, being a software developer. And if you're a freelancer, this is a definite skill that you must possess because clients want to talk to you, and you have to say things to them. You say things like, sure it'll scale your, I don't know, Brony social network. But what you really mean is, yeah, sure it'll scale long enough for like for your three months runway to like run out and you go out of business. No, I'm not that cold. But occasionally something does scale. Occasionally you need to deal with what I call big ish data. And big ish data is not big data. It's not architecture. It's not networking together, you know, like 40 EC2 instances and making some fancy architecture. What I'm talking about is how do you keep a normal pretty vanilla Rails app running as your production data set grows? And even if you don't have a fire hose of data, you can easily find yourself in this situation. So whenever I talk to people about this, they always want to know, like Star, like you're being wishy-washy, like define big ish data. Like what does that mean in terms of numbers? So I made up some numbers. I'm talking about like a terabyte of data in your database, a couple gigabytes a day, you serve it with a normal Rails app. And the good news is that Postgres can handle this really well. This is the sort of thing that Postgres just tears up. But chances are your application cannot. The reason is that a lot of the conventions that we use when developing web applications, especially using something like ActiveRecord, is these conventions work well for small and mid-sized web apps. Web apps that don't have, you know, gigabytes or terabytes of data. But as soon as you start using these conventions with larger data sets, you find that, well, it just doesn't really work that well. This is actually a live query that's been running since this morning. I wanted to show you guys how many errors we had in our database. But yeah, similarly, if you want to page your data, you install some paging gem and it doesn't really work. And so finally, you're just like screw this biggest data BS. I just want to get rid of this, right? This isn't worth the hassle, so I'm just going to delete this all. And that doesn't even work. Like if we ran this, this would take probably three days for us to delete that much data. And understand why this is the case, to understand how you can work around it and how you can configure your database so that it performs well under these conditions. We need to go all the way back to the summer of 1978. Summer of disco, free love in the VT100 terminal. Now the VT100 was the first computer-like thing that you could buy and put on your desk and prove to the rest of the office that you were in fact the alpha nerd. 1978 was the year that Oracle 1.0 was written in PDP11 assembly language. And just a little aside, this was the most interesting image I could find that had anything to do with Oracle. I looked for an hour. And if you haven't ever seen a PDP11, this is what it looks like. So nowadays, people, marketers use buzzwords like platform as a service. And we're doing everything in the cloud. And back then, they had marketers too. Only they were talking about real-time computing. Only for them, real-time meant that report that you asked for runs in five minutes instead of a day. And that's all great, I'm sure it was great at the time. But it doesn't seem like a very good platform to build the web on. But if you think about it, that's kind of what we've done. I mean, all the technologies that we use today for web development are either direct descendants or kind of bastard nephews of this stack. So how did that happen? Well, this real-time computing thing, it took queries from a day down to five minutes, down to five seconds, down to 50 milliseconds. And all of a sudden, you can use this as a foundation for a real-time web app. And all this happens because of Moore's law. And I'll just remind you, in case you don't know the formal definition of Moore's law. So as time goes forward, computers get more awesome. They can crunch more data. They can do queries faster. And so I want to postulate Starr's corollary to Moore's law, which is this. As your database grows, you actually travel back in time. And this is a situation that I found myself in about two years ago on my company. I'm with Honey Badger. We do exception monitoring for Ruby apps. And there's a lot of freaking exceptions out there. I mean, you guys need to do your jobs better. But the thing is, we didn't expect this to succeed. It was just a side project. We built it like a normal Rails app. And then people started using it and people started sending us thousands of requests a day and then thousands of requests a minute and a second. And we essentially found that we had to go back and start worrying about things that web developers were worried about five, ten, 15 years ago. I guess there weren't too many web developers 15 years ago. Or were there, I'm old. So the first thing we did, and the first thing I suggest you do, if you have any sort of database performance issues at all, just use a real damn computer, please. If you're on anything called Rackspace or EC2 or Linode and your database is slow, put it on a real computer. And it's just awesome, right? You'll probably get an order of magnitude performance. And if possible, put it on a computer with a ton of hard drives, right? Put your operating system on a separate hard drive, put your logs on a separate hard drive, and put your actual database data on a SCSI array. And that will give you awesome performance. And the reason that you want to do this is because, well, most VPSs have horrible disk IO. And disk IO is what databases eat, it's what they need to be fast. So that's the first thing, that's your first and easiest big one, is just throw some money at the problem by a computer. The second thing that we need to do is we need to understand our queries better, and I can put this on a slide because I'm from Arkansas, so it's not offensive. I haven't broken out my good old boy on y'all yet. So everybody knows what explain is, right? You feed explain this query, and it gives you back a query plan. It gives you back a query plan that looks about the same as the other one. But we're only going to be worried about one aspect of this gobbledygook. We're just going to look at rows. The number of rows in this case is how many rows does Postgres expect to have to chew through to give you the answer that you're wanting? And about rows, there's only one rule, right? More rows, more problems. Now let's go back to that original count query that we did that took forever. Why did that take forever? Well, it turns out when you explain it, it's actually, well, when you ask Postgres to count the number of rows in a table, it actually counts them one by one by one. Which it seems like there'd be a better way to do that, but I don't know. You wind up with that. When you do a pagination query, you wind up with a similar problem. If you happen to have a really high offset, an offset of, say, 500,000, Postgres counts up to 500,000, and then it counts up 100 more. Now, the best solution to this sort of problem is to make it so that your users cannot page through 100 or 500,000 records of data, like that would be the best solution. But if that doesn't work, there's sort of a little hack you can use, which is to use a range query, which can use the index, and which is actually pretty fast. So here's a similar, it's not the same query, but here's a similar thing you might do. And you can see it's only touching 100 rows instead of 500,000, 100. Sorting, I'm just gonna wave my hands around sorting, cuz it's a little bit complicated. If you would have asked me a couple of years ago, man, I was so naive about sorting. I just thought that I just had to put an index in my migration file, and everything would be good. Everything would be good as long as I put that migration in there, it'd be fast. But yeah, no, it doesn't work that way. Sometimes, if you happen to hit on exactly, if the star is a line, Postgres might be able to use the index for sorting. But most of the time, it's gonna actually be faster to load all that data into memory and do like a quick sort on it, which is kind of crazy when you think about it. But yeah, so be careful about sorting. And so your two big wins that I want you to go away with are, use a real damn computer for a database, and use explain. Really get in there and really understand what your queries are doing. And at this point, I think you guys deserve a cookie. So those are the big wins, right? And nothing could ever go wrong with this, as I'm sure, yeah, nothing could ever go wrong. So I know that does have anything to do with the talk, I just liked it so much, I had to put it up there. Okay, so I'm gonna run through some of the common problems that you find when you have these sort of big data sets, right? First, slow cache misses. Well, as you may know, your database has its own internal cache. And if you keep accessing the same data all the time, well, it's gonna load that into its internal cache, and it's going to be really fast to access that data, which is great until a user logs into your site who hasn't logged in in a couple months. And it takes them five seconds to load the page where it takes your frequent user a second or whatever. And so the solution to this, well, the main solution is to get a lot of RAM. But then you also have to configure Postgres to be able to use that RAM. Out of the box, the Postgres configuration is meant for the really the lowest common in the denominator. It's meant for those 64 megabytes of RAM Linux boxes that used to be able to buy like 10 years ago. So a simple solution to this is to use a tool like PGTune, which inspects your server and spits out a decent configuration file that you can use as a jumping off point. Also, whenever you have a cache miss, what happens is a database has to go to the disk and it has to pull that information off the disk. So disk.io is really important here. The easiest way to get high disk.io is to get one of those high performance like rate arrays and like spend 100 grand on a server. But if you want a much, much smaller performance boost, that's free. There's this interesting little configuration parameter on most Linux, I think all Linux systems, it's called the readahead cache. And what the readahead cache is that is when Linux detects that you're reading a bunch of blocks in sequence off of the disk, it says, hey, this dude's reading a bunch of blocks in sequence off the disk. I'm gonna go ahead and load those into memory so that I can serve them from RAM instead of serving them from the disk, which is a great idea. But normally, Linux ships with this number set super low, like 32K or something, like a ridiculous number. However, you can increase that to two megs, four megs, whatever your system can handle. And a lot of people say that they get a big performance boost on this, especially on reads. Next problem is slow vacuums. If you have a database server that's under a ton of load, you may log in, you may run the process command, and you may see that there's all these Postgres processes that save vacuum next to them. And you think to yourself, well, those aren't my user's queries, that's not actually doing anything for me. So a common mistake that people make is to sort of decrease the amount of vacuuming that Postgres does because you think, okay, well, this is like a maintenance thing, I can maybe turn off auto vacuum and do it once a night at midnight, or maybe I can only do it once an hour. But the problem with that is that vacuuming is essentially garbage collection for Postgres. So if you turn it off, you wind up with Postgres consuming all of these resources and never sort of freeing them back into its own pool so it can use them again. And so your performance will just go down. So actually, if you find yourself with a vacuuming problem, you need to vacuum more. And that's true for both your house and for Postgres. Also, when you're looking at your processes, you may notice that you have a ton of database connections happening, or you have a ton of Postgres processes. And the reason that you have a ton of Postgres processes is because, well, for each database connection you make, Postgres does a new process. That's not something you typically have to think about that much for a normal web application, like a small to medium sized web application, because, for example, like ActiveRecord has connection pooling that sort of takes care of this in the background. Make sure that you don't have to manage these connections manually. However, if you wind up running a lot of jobs, for example, in job queues, if you have a very distributed architecture where you have, say, 20 different servers running unicorns for whatever reason, or if you have 20 different servers running different jobs, and they all connect to the database, you can wind up racking up a thousand database connections, it's not that difficult. And that's bad, because each of these database connections is a process, and that takes overhead, and that slows down your database in general. So, as mentioned, you can either use these, you can use an application layer connection pool, or you can use a tool like PG Bouncer, which is actually a proxy that sits between Postgres and your app. And so, PG Bouncer maintains, say, 50 connections to the database, and then your app and all of its components may have like a thousand connections to PG Bouncer. So that's one thing. Well, and if you have so many incoming connections, if you have so many jobs happening, you probably have a lot of inserts and updates and all that stuff happening to your database. Well, this gives you locking problems, or it can give you locking problems. One common example that I've actually run into is it's really normal for us to have, say, an active record, you have a parent class, a parent model, and a child model, where the parent has many children. And every time you add a child or delete a child, it updates a countercash on the parent model, right? And that all works pretty well until you have somebody updating or adding like 10,000 children to the same parent within the course of a second or two. And so what happens there is that first update that gets through, it goes in and locks that row in the database so that nothing else can modify it. It updates the cache, it releases a lock, and it's good to go. The next query comes through. And that probably works for maybe a couple hundred queries. But by the time it hits a couple hundred queries, that 10,000th query has already timed out. Either your application server is timed out, or your workers have timed out, or your connection have timed out. Yeah, I meant to hide this slide. But so imagine you have, it's really common to have the situation where you have all this data in your database and you want to be able to run these complicated reports on it. The problem with that is that that can sometimes adversely affect your application performance. Like these reports are super intensive database queries, but you don't want them to affect your real-time application. And in that case, it can be really useful to have just a read-only replicant that you only use to run those reports on. And then you cache the values of those reports. Now finally, and not finally, there's one more slide. But unless you wind up with some sort of heavily sharded architecture or some sort of distributed system, you eventually need to start taking data out of your database, right? You need to either delete it or move it into some sort of archival system. But as we've shown already, those delete queries are murder. They take forever. And for this, Postgres has a really nice set of features around auto-partitioning. If you haven't seen it, auto-partitioning is, it essentially lets you define a parent table and multiple child tables. And they all have the same schema. And what happens is your Rails application only talks to the parent table. It only knows about the parent table. But when you insert a record, Postgres then looks at that record and based on rules that you've specified, it stores that record in a child table. Then when you go to query the parent table, it can pull it from the child table. And so this is used really commonly, for example, when you want to expire something by date. So I can say that all errors that come in for today go into a table that is specific to today. And the same thing for yesterday and the day before. And so when I want to delete errors that are 180 days old, I just go back and drop those child tables that are 180 days old. And you can use this for a number of really interesting things. You can partition by account ID, all sorts of cool stuff. And finally, really finally, backups. When you're dealing with a big data set like this, you really need to investigate a streaming backup system. Because that whole dump the entire database at midnight and upload it to S3 just doesn't work anymore, right? Unfortunately, Postgres has this nice feature called, well, it uses these things called wall files. It uses them as part of it to sort of prevent disasters and also to aid in its replication. But this allows you to do a nice sort of streaming backup where you actually have a backup of your database every, say, 20 minutes that's sent to Postgres, or sent to S3. Okay, so I've just been talking at a million miles a minute. And that's a lot of stuff to take in. But I just want to point out that you guys can totally do this. That me and my team, we are normal Rails developers. We built this thing, it had to scale. And we sort of figured out how to make it work. And you guys can do that too. So yeah, this is my favorite guy. This is a Mexican soccer coach from the World Cup. This is the best thing about the World Cup this year. Anyway, if you want to see this slide deck and also some of the links that were in there, I'll be tweeting that out. If you want to know about your errors, use Honey Badger. And I'm done, so.