 Hi. OK. My talk is called, oh shit, Lessons in Continually Blaking a Huge Website. Can everyone hear me? Good. OK, that's good. My name is Tim Morgan. I'm a senior Rails developer at Scribd. I was hired at Scribd as kind of like a Rails guru, because I knew the API really well, but I didn't really have a lot of experience in scaling. That was kind of a trial by fire for me at Scribd. I learned a lot of things the hard way. And so this talk isn't going to be about security vulnerabilities or anything like that. Instead, I'm going to talk about what I've learned as a Rails developer for Scribd. I like doing it this way, like how I break it, because I think engineers sort of we learn the most when we break things, when we're knee deep in the debugger and stuff like that. That's when we really start to understand Rails and the stuff that we're working with. So a bit about Scribd. Scribd is a really, really large Ruby on Rails website. Here's our Quantcast page. We do about 40 million users per month in traffic. We serve 36 megabytes of traffic per second, and every day 50,000 new documents are uploaded to Amazon S3. My point isn't really to dazzle you with statistics, it's to let you know that when you have a website this big, it becomes really, really, really easy to break it if you don't understand exactly what you're doing and exactly what Rails is doing. So I'll be taking you through some of the more infamous mistakes that I've made. These mistakes are almost always problems of scale. Is this still on? It's almost always problems of scale. It's stuff that you wouldn't even think twice about writing if you had 3,000 users per day on a website, like a hobby site. But once your traffic explodes, it can really come back to bite you. And almost all of them center around how Rails uses a database at Scrim to use MySQL, so there's gonna be some MySQL specific stuff in here, but I also try to generalize it a bit, and I try to talk about Postgres and Oracle if anybody's using that too. But it is gonna be a little MySQL specific, so bear with me if you don't use MySQL. Okay, so my most devastating mistake ever at Scrim, which I will talk about in a bit, resulted in our DBA giving me a few good pointers, and it actually resulted in yelling a few good pointers. Here's his log from Campfire. Basically, in short, what he was telling me to do is to understand what the code is doing. And it's crucially important because Rails cannot be relied upon to generate the best query for every situation, and in fact sometimes it can't even be relied upon to generate a good query for the situation. And also, I hope everyone is familiar with the SQL explain and explain analyze tools. If not, you just stick explain in front of a query. It doesn't run the query, so if the query's slow, you don't have to worry about that. It just gives you some information on how it would run the query, and that's very, very useful. And you'll see some examples of that in the upcoming slides. So basically, I'll boil it down to you for a few points. Every time you add a feature, you should run your feature on your site and look at your query log, unless it's the absolutely most trivial of features, and you know it doesn't touch the database. You should be looking at your query log. You should understand what each query does and know it's not a problem, or if you suspect it might be a problem, you should be explaining that query. You should have MySQL explain that query. If you test with a large database, you'll get frustrated at how slow some parts of your code are. And well, guess what? Your users are frustrated too. I'm actually, this is actually a little hypocritical of me because I don't always do this, but when I have done it, I found it really useful. I know that with Rails 3.0, we're going to be a little more agnostic with our ORM, and maybe some of you have used Datamapper. Datamapper has a really cool tool called DM Sweatshark, which gives factories double duty. It lets you erect a fully populated production database using your factories, and you can, you know, with arbitrary numbers of rows. So you can, in a small amount of time, have a database bigger than you would expect to have in production. And then when you play with your site, you might notice some things are crucially slow, and you'll be just as frustrated as your users would be. And lastly, you should pay close attention to your indexes. Hopefully everyone here understands the basics of indexes. If you have a users table with a login column, and you need to look up users by login, you need an index on the login column. Things get more complicated with more complicated queries. Indexes are very database specific. MySQL and Postgres use very different strategies for using indexes in queries. So it's important to understand how your database uses indexes, and what indexes you'll need for different queries. So we'll start with the anecdotes. The first one's a subject that's been a little beaten to death already, but I've added what's hopefully a novel twist to it. And it's the problem with find and batches. And what we're talking about here is when you need to update every row or a lot of rows in your table, but it's not the kind of thing you can do with a SQL update statement, you need to instantiate every row as an active record object, and you need to perform some kind of operation, like downloading from S3 and uploading to a different bucket, or something like that, where you can't just do it in SQL, obviously. So hopefully everyone sees the problem with this. If you do user.all, you're going to instantiate every single record in your database all at once. If you can even fit it in memory, if you have a lot of users, it's still gonna be very slow. And of course, it executes a very slow query, select start from users. So then in Rails 2.3, we get the find each method, which calls find in batches. It solves this problem. Hopefully most of you know how this works. It grabs the first n rows from the table and looks at the ID of the last item in the batch that it grabs, and it uses that to get the start of the next 10 rows. We don't use offset, particularly in MySQL because offset is slow. It does a linear iteration over the table to find the n offset item. That's O of n, whereas if you use the primary key, which you assumedly have an index on, it does a B tree lookup, which is O of log n. So the problem is with composite primary keys. If you have a table like this, a group memberships table that joins a groups and a memberships table, it doesn't need a surrogate key from the database. The user ID and group ID are enough to uniquely identify a row. So you use a composite primary key. Rails doesn't handle this very well on its own, but you have the composite primary keys plug-in. So the find and batches method, which I've copied and pasted here, will break if you use composite primary keys. Not realizing this, I foolishly just did find and batches, pushed to production, and script goes down. So there's a couple of bugs in there that I discovered while I was fixing it. You can see right here that the table names are not escaped. So if you have a table name, that's also a reserved keyword it's gonna break. And lastly, these records are yielded within the scope up there. So that width scope and then you yield the record. So if you have your own scope, it's just gonna be overwritten. So anyway, the composite primary keys method, a lot of you are probably familiar with it. You can do something like this where you pass a list of primary keys. And then when you call your model.primarykey, you get an array back instead of a string. So the solution in this case is to build a query generator that loads batches of objects in the same way you count multi-digit numbers. So if you have three primary keys, it'd be just like how you count three-digit numbers. I'm just gonna go through it briefly because it's not worth it to get into the details. So you grab the first five rows from the table. You note the primary key combination of the last row and then you grab all rows. You grab the next five rows where the user ID is greater than five and the group ID is greater than one or with a group ID greater than one and limit five. And so then you get the primary key ID of that, last item and so on and so forth. Hopefully you get the idea. The full solutions available at this URL if you use composite primary keys and you need to use find the batches, you can use this too. It's just a gist URL, it's a monkey patch. So monkey patches, the default find the batches implementation, obviously there's some contention about monkey patching. So if you don't like it, you can just rename the method or something and then it won't overwrite an existing method. It also fixes the two bugs I mentioned earlier. If you have your own scope, your records that, the records or this method will not overwrite that scope. And you know, you can use tables with reserve SQL keywords, standard disclaimer. There might be bugs, visited scripts, but you know, it wouldn't be the last time I pushed buggy code in the script. So you should test it before you push it. The next one is my most infamous moment. It's the problem with validates uniqueness of. Some of you guys are laughing. Some of you guys have seen this before. Yeah, it's really bad. It's really bad. So it stands as an absolutely shining example of why you should watch your query log and you should explain queries and how even the simplest of changes need to be investigated in the database. So here's an abbreviated, a very abbreviated example of Scripps user model. This is pretty standard stuff. Then the requirement comes down the pipeline. We need case insensitivity in logins. We need it so that if I sign up with one login and you sign up with the same login about a different case combination you're prevented from signing up. So we all know how to fix this problem, right? It's pretty easy. Case sensitive is false. So I get this, it's like, it's 10 p.m. And I get this down the, on fog box and I'm like, okay, I'll just quickly do this before I go to bed. So I add it, push it out, go home. Yeah, and I, exactly. I agree with him. And I come to work next morning and I discover Scripps is down. Everyone on campfires mad at me because they get bisacked at it down to my commit. So yeah, get bisacked. It does not take favors. So how could this have happened? Let's take a look at the SQL that Rails runs when a user model is validated. So it does this, the offending clause here is the lower login portion. My SQL has to calculate the lowercase version of every single user's login one by one. And with more than 10 million users that we had at the time it was taking two minutes to run that query on our database. And this was being run every time anybody tried to sign up for Scripps which is multiple times per minute. So soon enough, Scripps goes down. And I'd also managed to seriously anger my DBA, this guy. So there was one, he doesn't, he works remotely. So I'm safe for now. There was one positive bonus from this whole thing. However, I learned what my name is in Srella. So what's the solution? Well, if you're fortunate enough to use Postgres or Oracle, well, you can just use a functional index. You can create an index on lower login. It'll take some time to create the index but then your problems are essentially solved. However, MySQL doesn't support functional indexes. The solution of MySQL is a little bit trickier. It's something that maybe you would not, maybe you don't expect if you're not that familiar with MySQL. If you see here, MySQL is case insensitive for string comparisons by default. This is actually an even bigger problem because now it validates uniqueness of and MySQL are kind of diametrically opposed. MySQL will make a case insensitive regardless of what you said validates uniqueness of too. So basically wherever there's a validates uniqueness of it's implicitly case insensitive. So the solution here is the use of the binary keyword in MySQL. This keyword, when placed before a string tells MySQL to treat the comparison as binary data instead of as a string and it incurs no speed hit at least as far as I know. And so basically you need to modify validates uniqueness of to include this binary keyword when case sensitive is true and leave it out when case sensitive is false. It's also very important. This is actually only one of two times I broke script that day because I actually committed this. And that's also order and for whatever reason despite the fact that it doesn't actually modify any data when you stick it next to log in its order and because it does it runs that binary keyword on every user's login which is essentially the same problem that we had before. So it's important to put it on the right side of the equal sign or at least on the side of the constant string whichever side that's on. Again, we've got the solution online. It's another monkey patch to validate its uniqueness of. It's really just there to kind of get the code out so people can see it. And this monkey patch also doesn't check if you're using MySQL and it on binary is not a keyword in other database languages. So you should probably only include it if you have a MySQL project or if anybody wants to make a plugin out of it that's smart and checks if you have MySQL and loads itself and otherwise not. I mean, that's fine too. Also standard disclaimer may contain bugs. So the problem with delete and destroy is really more of a problem with Rails documentation than anything else. In general, Rails documentation is pretty explicit about how different methods work. One area this falls short is the use of delete and destroy when you use it with belong or has too, sorry, has many and hasn't belongs too many associations. Their operation changes depending on how you configure your own association which can seem like unrelated code. And it's really an issue with how the documentation is set up or just the architecture decisions that were made. I'll take you through a real world example that of course happened in script. So here we have a category model. Categories group together documents with similar subjects. They are joined through a category membership join model. So category has many category memberships and has many documents through category memberships. Now we have a dependent delete all setting. As some of you guys probably know, dependent delete all is faster than dependent destroy because dependent destroy runs the before destroy or the after destroy hooks. So it has to instantiate each row as an active record object to run the hooks. Whereas if you don't have any hooks, you just use delete all and it just does a sequel delete. Anyway, down the pipeline comes a requirement. When we delete documents or when we remove a category, we want to tag documents with the name of that category. So if we have five documents in the aviation category, then we want to tag them with aviation before we delete that category so that we can keep track of where they were or at least have a relevant tag. So everyone knows how to fix this. You just add a before destroy that tags those documents. And you can all imagine what the implementation of that method might look like. Well, it didn't work. Here was my implementation of the method. We're using somewhat fancy sequel syntax so that we don't have to iterate and instantiate each rows an active record object. But what's important here is the sub query that I've highlighted in red, which just selects all the ideas of all the documents in the category. When this query is executed, you might be surprised to discover that it returns no rows. In other words, by the time we get to that before destroy, all the documents have all the category memberships have already been deleted. So this does not make sense, right? Like, why would rails be removing associated objects before calling the destroy method? Like, why is it running that dependent delete all before doing the before destroy hook? Like, that doesn't make sense. The fix is not obvious and it's probably really frustrating to a lot of people. You just have to move it to the top. That's some Rails magic. Let's explain why, what the Rails magic is. HasMini is actually doing a little bit of metaprogramming. When you call HasMini, it calls, you know, on the receiver, it calls instance or class eval before destroy. It essentially invokes before destroy and creates its own hook that does the deleting. And so before destroy hooks are run in the order that they are defined. So if you put that HasMini, before you're before destroy hook, it's before destroy hook, it's gonna be run first. Which is infuriating, but it's the way it is. But let's say you, like me, had no conceptual idea about this. Did not, in your wildest dreams, believe that moving to the top would actually fix the problem. So you made sort of a different fix. You just removed that dependent delete all. You said, okay, I'll delete the damn things myself. You removed that dependent delete all. And instead you add this at the very bottom of your method saying, I'm going to run this first and then I will delete these myself. Thank you very much. Well, now you may be surprised to learn, hopefully before you push it to production, you may be surprised to learn that your delete all method does not delete the category memberships anymore. Now it merely set their category IDs to null. Why? It's tricky verbiage in the documentation. The delete all method removes all methods from, or removes all records from the association. It doesn't say anything about deleting the records. It merely says they're removed from the association. And so Rails and its computer mind believes that by removing them or by sending the category IDs to null, it has effectively removed them from the association. Rails has no idea that a category membership with a null category ID, screw you. Okay, keynote. Rails has no idea that sending the category, or that a category membership with a null category ID is essentially dead weight in your database. So you were probably looking for something like this. The active record-based delete all method, which will actually do the SQL delete and do what you wanted it to do. So this is simple, right? Association.delete all doesn't delete the records. It merely sets their foreign key to null. You just have to remember that. Well, not quite. Apparently you actually needed table to figure out what delete all does. So it depends on what your dependent option is. And it makes a little sense if you read through it. If your dependent option is nullify, which is the default, then it sets the records foreign keys to null. If your dependent option is destroyed, then it loads each record, runs the before destroy hooks, deletes a record, runs after destroy hooks. If your dependent option is set to delete all, then it deletes the records from the database. It makes some sense in some way. It's just a little quirky. But there is something here that should catch your eye, a caught my eye. It's actually two things. What's up with the selects? Why do we need to do a select before we do an update? Why do we need to do a select before we do a delete? Well, let's be smart. Let's investigate the query log and let's see what's going on. It looks like what's going on is that it is first selecting, it is doing the select that we expect with or doing the select with the where clause that we expect to get a list of IDs. And then it's doing an update and passing in a list of IDs. Well, that's fine for the nine category memberships that I have in this example, but for my SQL in particular, it doesn't do very well with lists of IDs. Even though, even if you have an index on ID, this is still going to break if you have a particularly large category that you're deleting. There's an even bigger problem though and the observant among you may notice that we're using ID and category memberships is a join table. It shouldn't have an ID column. Oftentimes our older join tables in script do just because at the time we didn't really know about the ID is false option, but now we do and so we don't have IDs. And the truth is, if you don't have an ID, you actually get this, which is useless. So composite primary keys, a recent version has fixed this, which is good, but it doesn't fix the larger problem, which is that we're still using a large list of, in the case of composite primary keys, we're using just a bunch of ands and ands and ands or we're just using an in and a bunch of items in a list. So we're back to our refrain. What's the solution? When in doubt, fall back on methods that are well-defined. No matter what parameters you give this method, the active record base, delete all method will perform one and exactly one delete and nothing else. And so don't be afraid to give up some of that Rails magic in exchange for knowing what's going on with your database. By the way, a little aside note that there's no conditions hash there. If you put a conditions hash in there, Rails will say conditions, well, that's not a column in the database, so I'll just delete everything. That's how we lost our credit cards table. This is a true story. So, yeah, fine. Back up. Well, actually at the time there were only, like, we had just launched our store so there were only a few credit cards so we were just like eh. Eh. Sorry, you guys. And the sad thing is, the sad thing is because, you know, I don't know if any of you worked with payment systems but you can't store your own credit cards in the database. You have to store them in a vault. So really, they were just, you know, the IDs of credit card that were stored by some big enterprise vault company. And so, and that, so basically when these users, they go and they're like, wait, no, I did type in my credit card. So they'll type in their credit card and we'll send it off to the vault and the vault will say, we already have this credit card. So they'll get it back saying they already have it but they don't have it. It's actually pretty bad. But fortunately, there's only a few of them that'll have to, you know, never be able to use their credit card again. Anyway, find conditions. The lead-all, no conditions. This will of course be fixed in Rails 3 with the new abstract query logic thing whose name I forget. Errol. Errol. Yes, thank you. Yeah, Errol's wonderful. Removes this problem entirely. Lastly, we're gonna talk about very MySQL specific thing. Again, sorry for anybody who doesn't use MySQL but, you know, sort of most of our problems come from, really. So the problem with indexes and really indexes are a wonderful thing. I shouldn't say there's a problem with them because they're not, because if we didn't have them, we'd have a lot more problems. Indexes though are the source of many, or the lack thereof are the source of many big problems in large websites. The problem is that your database performance depends critically on having the correct indexes but a lots of times, especially people who are just getting started in Rails, they don't understand what to index and when. And the other problem is, as your database grows, it gets harder and harder to add new indexes. So it's important to think about what indexes you might need early on when you've only got a few thousand rows in your table and before it gets slash dotted, dug in hacker news, then now you have a problem. You know, now it's scribbed, we just, we simply can't add indexes to certain tables anymore. We either add them to slaves or we find creative ways to use existing indexes. It's, you know, we, all sorts of things that maybe I shouldn't even be telling you about. So here's some pretty straightforward code to load. A user's documents for their profile page, if only it actually looked like this at script. We just have a bunch of namescopes, document.ofuser, we could also do user.documents but whatever, paginated and we have a params page and order them by uploaded and all. And so, you know, that's fine. And of course, since we're smart or well, we take a look at the query and there's nothing wrong with the query. It does exactly what we thought, strings together, all those things. Anyway, so down the pipeline comes the order that documents should be paranoid delete. For those of you that don't know, paranoid delete means you don't actually remove anything from the database you just set. They deleted Boolean to true and that way you still have the data in case they want to undelete it or something because people are pernicuous or for whatever reason. Or in the case of the way we do it at script and the way the axis paranoid plugin does it, it's a deleted at timestamp and if it's null then the object is not deleted otherwise it's the time it was deleted at. So, we all know how to fix this, we just set a not deleted namescoe, right? So that's easy. Well, if you push it to production, script goes down. So, what went wrong? Let's take a look at the query. Nothing wrong here, that's exactly what it should be. You just add and deleted at is not null. That's fine. So we're gonna delve deeper into the indexes. If we show create table here, you can see the problem. We have an index on user ID and we have an index on deleted at but we don't have an index on user ID and on deleted at. And MySQL is not smart enough to somehow combine those into an index on user ID and on deleted at. So it also, MySQL important though, only ever uses exactly one index for each query portion of your statements. So, it's gonna pick. If we explain, MySQL confirms it, we got some scary looking stuff in the extra using file sort should scare people. And the fact that we're using index document on user or index document on user ID is also pretty scary. It's a full table scan. So the solution of course is to add an index. You add an index on user ID and delete it at. You run the migration, your problem is solved. You don't even have to change your code. If you are smart and you run the explain before you push to production, you'll see that is in fact using the correct index now. I actually forgot to change it. It says it's still got all that scary stuff at the bottom but whatever, this is just a fake explain. So it says index documents on user ID and deleted at which is good. That's what we want. We're done. What about the just in time solution? That index is probably not tenable for large companies. If you're, our documents table right now is millions and millions of rows. And if we were to add an index, it would lock that table for a very long time. And not to mention that table or that index has to propagate to all its life. So it's a big mess. You can at least force MySQL to use the more restrictive of the two indexes, which should hopefully fix some of your problems for some time until you can find a better way to deal with this problem. Here we're using the use index feature of MySQL which forces it to use a particular index. In general, MySQL is reasonably intelligent, I would say about picking the correct index. But yeah, there's some shaking heads. It can go really, really wrong at times and if you, as an intelligent programmer, don't want to leave it to chance, you can just use the use index or the force index statements in order to ensure that you're using the best index available. In this particular instance, we're assuming that each individual user has fewer documents than the set of all not deleted documents which is a reasonable bet to make. That way, we'll narrow it down quickly by user ID and then all we have to do is iterate over the documents in an order and time. In order and time, just to filter out those that are deleted. Yeah, and so as long as you're not loading some giant monster user's page, it should be at least reasonably fast for the time being. So normally MySQL will pick decent indexes if you have only a few indexes in your table and it's pretty obvious which one to use for a given situation but as you start adding more indexes, you need to explain your queries and make sure it's using the right index. And you need to provide index hints like this when necessary. There's unfortunately no easy way to do this in active record prettiness right now but sometimes you have to say goodbye to active record prettiness. So in conclusion, I'm going to reiterate my original points. You need to always understand the queries that your code is generating unless the query is trivially obvious. You should be thinking about what the implications it's going to have on their scale. You should be explaining it if necessary. Test with a heavily populated database. Again, I don't always do this but I really should and so should you. At least a couple of times. Again, DM sweatshop is nice. Hopefully there's something like that for Rails that I just haven't discovered yet and just prop up giant databases with random data and test with them and look for frustrating slowness. And lastly, pay close attention to your indexes because indexes are very important and oftentimes new programmers just don't understand them. So that's the conclusion of my talk. Are there any questions? Have you done any experimenting with Postgres with some of the same techniques? Yeah, actually, so Scribd runs on MySQL but I use Postgres for all of my little websites and I run into some problems with Postgres. The functional indexes is nicer as for some of the nittier and grittier stuff I only really understand that in MySQL just because I don't run a website like Scribd in Postgres and that's just the way it is. But there are lots of experts, I'm sure, that besides me, that would be very helpful. Any more? Have you looked at trying to, I mean there's a lot of what they call NoSQL, database technologies around now and a lot of the stuff that we're doing in Rails is using the database to model references. And when you have a million row table and a million row table and you need a bunch of references, that gets really huge and it's hard to do in a relational technology. So have you looked at trying to model some of this stuff in a non-relational technology, whether that would be an advantage or just a lot of household or the like? So NoSQL is getting pretty big lately especially with technologies like from Facebook and Twitter who have been kind of paving the way there and that's good. I agree with you and probably a lot of people here that relational databases have sort of a specific use case that they're really good at and they've just kind of been blown out to other things just to model an entire website's data structure. Unfortunately at Scribd we haven't delved into that a lot. Tyler is gonna be speaking next and he knows more about this than I do. He's more on the cutting edge of this stuff than I am. But we still use MySQL for almost all of our core data. This is a change that may come in the future, hopefully, but not yet, unfortunately. Let me go.