 So my name is Tamer Saleh and I work for Thoughtbot over in Boston and today we're going to talk a little bit about coding for failure, how to build rock solid applications in 45 minutes, not 60. All right so the talk instruction in pretty much three main sections, the first two are a little more philosophical, a little more energetic, the last one is going to be a lot more code. So the first thing we're going to talk about is why you want to build rock solid applications. Just bear with me on that. The important part is that it's for the people, it's for your clients, the people who are going to be using your applications, you want to make them happy and there's a lot to be said about that, especially in the book Defensive Design for the Web by 37 Signals. A lot of that is the philosophy that went into the Ruby on Rails framework, how easy it is to add good user feedback through things like flash messages, error pages, validations, failure emails are a great example of something that people don't do, tripit.com does. If you're sending an email to an automated system and it can't parse it, it sends you back an email explaining what went wrong. Little details like this, also XML responses are something that are often ignored. If you try and post to a site with XML incorrectly, often you don't get back the errors that you should. So those kinds of details that we're going to talk about today, how to make your apps deal better with those kinds of situations. I want to talk a little bit about error pages real quick. Here's an example of a great error page, the fail well from Twitter. This is not just a 500 page, this is actually a capacity issue page, right? So they had to take great pains to create a special error page just for one of their major problems, right? Now this is a liability having capacity issues, Twitter's known for the scaling issues. And yet people love the fail well, right? People really love the fail well. And this got Twitter all kinds of publicity. They took something that was a distinct liability and turned it around through good user interface, through treating their clients well. They turned it around into something that worked for them. The other reason that you want to build fault tolerant applications is for yourself. You want to be able to sleep at night. A good example of this, I used to be a systems administrator over for a city search. And we used network appliance filers. This is essentially a big disk array, insanely expensive, but it's expensive for a reason. When a disk failed on a NetApp filer, it actually notified the company and sent a disk in the mail before I woke up. I would wake up to a disk failure message and a message that says we've sent your disk in the mail, right? That's really good, that's really good service for the people who are buying NetApp. That's why they could charge so much money. It's because they think about those situations, they think about the right way to solve them. So the next thing we're going to talk about is when you want to deal with these issues. It's not always clear. The most important part is that you want to think about the failure points in your application. You're not going to go around fixing everything. You have to know what parts to concentrate on. Some of the failure points that you'll see are external services. These are like the most obvious. Your e-mail server or a web service that you communicate with, your G-O-I-P, whatever. When you're communicating with something like that, that's a distinct failure point, something that you have to code around. You have to be very careful when you're dealing with those. Another one is internal services. If you're running Memcached or full-text search engine, part of the reason that we shy away from running those early on in a project is because that adds another point of failure that you have to be very careful with. Of course, things like your database and your file system are also internal services that can fail that you have to be aware of. Complex operations, complex systems are something you also have to worry about. If there's something you're interacting with, like dealing with Git, for example. It's a very complex system. It's somewhat external to your application. When you're working with Git or with the command line or with the OS, you have to be very aware of what you're doing. Performance bottlenecks. If there is like Twitter, for example, if there's a place where there's a known performance bottleneck, you have to be aware of that in order to give the best service to your clients. And entry points. If you have one spot in your application where lots of people are entering into that, with web apps, it would be the front page with services. It might be the creation action on a certain model, something like that. You want to surround that with defense. You want to make sure that that's going to work correctly in all situations. But the most important thing to know is when to worry and when to relax. You can't spend all of your time fixing everything on your app, making sure that no situation goes uncut. The important thing to know is that you're not NASA. You're a web company. Come on. How important is this in certain situations, right? A great example of when you can relax and when you should relax is in the destroy action on Rails controller. If you look at the scaffolding that's produced on those, you'll see that the destroy action isn't wrapped in an F block. It's not catching any errors that came out of that. It's an implicit assumption that when you call project.destroy, everything's going to be fine. There are all kinds of things that can go wrong in the destroy method on an active record model. You can have database issues. There might even be validations you didn't know about, something like that. But the chance of that happening is very slim and the impact to a user is also fairly slim. And that's kind of the important thing to know is that it's always a trade off. You can't surround everything with all this error catching and all these what if situations. You have to be very careful about choosing the right ones and working on those. It's a trade off of effort versus impact. I had a designer make that slide for me. All right, now we're going to get into the meat of things a little bit more and I talk some about code. There are three guidelines that I like to give for building an application that handles failure well. The first one is fail fast. How many people here have heard of that phrase? Great, great. About half a year. Jim Shore wrote an essay about fail fast and he said think about what kinds of defects are possible and how they occur. Place your assertions so that the software fails earlier, close to the original problem, making the problem easy to find. The idea behind fail fast is that you catch bugs as soon as they happen. And the way you do that is you surround things with sanity checks. You don't want your application to try and stumble along. You don't want it limping. If there's anything wrong, you just want to die. You know about it as soon as possible. You go on and fix it and you start it back up again. Here's an example of a sanity check that we use in one of our applications. It was dealing with an external API and it was for purchasing. The API was known to go down for long periods of time, something we had no control over. And it was really bad user experience to have to go through this whole purchase wizard before they found out at the very end that the purchase wouldn't actually happen because the API was down. So we put in this service available method which essentially pinged the API, made sure that it returned same results. And if it didn't, it notified us both by logging and by using Hoptote, which is something I'm going to talk about a little later on. Another great example of sanity checks is Subversion. How many of you guys have ever typed this command? Svn commit dash m, which is for the message that's going to be in the commit. And then you forgot to actually give the message. You just give a path to a file. Has anybody here ever done that? You have about five of you. When you do that, Subversion says, this commit message looks like a file path to me. I see a file that matches the path for this commit message. I'm not going to commit this. And it gives you a nice little error message saying, you know, please change your commit message. Or did you mis-type this? That's a great sanity check that really improves the UI because the alternative is pretty bad for Subversion. That you would commit a whole bunch of files except one and your commit message wouldn't make any sense. You wouldn't know anything went wrong until you did Svn step next time. So that's great attention to detail for a common failure, right? The second guideline is to fail loud. The worst thing that I see in code is something like this. In some task, somebody is looping through a bunch of records, modifying the records, and then just calling save on it. They're thinking to themselves, well, it might fail, but I don't know. I don't really want to deal with that. I don't know what to do if it does fail. Just shh. I mean, the simplest way to solve this is to add a bang, right? Save bang will now raise. The first one that it comes to, it will stop the loop, raise up an error. That's an okay solution. It's much better than failing silently. The problem with that, of course, is that now you've got half of the records that have been updated, other half haven't, and you've got a pretty bad error message presented to the user. Of course, if you want to take it to the next level, where you're actually going to loop through all of them and save the error messages and present that to the user in a friendly manner, it's a bunch of code. I mean, it's a pain at this point. And this is what I'm talking about, about the trade-offs you have to make. This is a lot of effort, and more importantly, it's a lot of code which is a liability. You want to minimize the amount of code in your application, but you want to give a good user interface at the same time, right? So, depending on the application, this might be what you want to do. If it's something simpler, something less of an impact, then the previous slide works just fine. One of the ways of failing loud, of course, is to log things. It's a very simple method. It doesn't send anything to the programmers, but it keeps track of what issues have happened, right? It's useful. And the Rails logger, because we build mostly Ruby on Rails applications, the Rails logger is very accessible. In any application, there's a Ruby logging library. Very easy. You can also send out an email. Sending out an email, unfortunately, is so complicated that I don't want to put any code in there. And then, of course, you can page the sysadmin. I don't even know how to do that. I wasn't going to try and put code on that one. And then you can use exception notifier, which takes care of sending the email for you in any exceptions in a production Rails application. The problem with that is that you get information overload. How many people here use exception notifier in their applications? How many people have ever had an application that's in the wild, a lot of hits, and you've got one error that just slams your inbox? How many people have been able to work while that's happening? Yeah, that hit us pretty hard. So actually, because of that exact situation that happened to us in one of our larger applications, we realized we needed to come up with another solution for it. This is not a talk about hop toad, but I've got to mention it because it's very relevant. How many people here have used the hop toad service? All right, all right, not bad. Essentially, it replaces exception notifier. And instead of sending out an email, your exceptions turn into, you know, just a post to our hop toad service, and then we kind of keep track of the exceptions that have happened. We send out an email for the first one, and then for subsequent exceptions, you just see the count, like that one, that was pretty big. And it just correlates it for you, multi-user, things like that. Hop toad can also, you can also trigger exceptions on your own to send a hop toad, which is a very useful feature. So you can rescue something, still present a very good UI to the user, and yet still be notified via email that that issue happened, right? You can also send arbitrary information to hop toad. If you're doing something that's not in a Rails app, if it's just in a Ruby application, you can still use this just by constructing your own message. So the third guideline is zero trust. Components should never trust their inputs, and components should always do their best to provide sane outputs. Now this is something you don't see very often in Rails applications, or Ruby applications, but it's a very good coding practice that was kind of made famous by Dr. Bernstein, I believe his name is, DJB who did Qmail. So this is only about a quarter of the diagram here, but this is the security architecture of Qmail. Later on at the last slide I've got references for the URLs for anything that you guys want to look up afterwards. Essentially each one of these is a separate process, and they take input on standard input and they send it out on standard out, and they are very paranoid about their inputs. Tons of sanity checks that ensure that data that flows through the Qmail system is checked again and again and again, so that there's no security problems, and there's very little crashing and stuff like that. It's a very good architecture. Alright, let's see some actual code here. For external services, something that we see often that isn't treated well enough is net-HDP exceptions. How many people here know how many exceptions you have to catch when you're doing a post for net-HDP? That's a leading question, nobody does. This unfortunately is the block that you have to use when you're dealing with net-HDP. There's no like three exceptions that actually descend from the net module, but then there's also a couple of like a timeout error. I don't even know what these are, connection resets, EOF error. There's a lot of them that come from the underlying Ruby code that net-HDP is built on, and if you don't rescue all of these, users will see a 500, so you have to be able to rescue all of them. Same thing with SMTP, which is actually a little bit more tricky. When you're sending an email, there's actually two different types of exceptions that can happen. There's server exceptions and client exceptions. Server exceptions, of course, are like site timeouts, things like that. Client exceptions are more about email addresses being malformed, that sort of thing. Depending on the situation, you might want to catch both of them, you might want to notify the user that their email was bad, or you might want to say that the site was down. This is the technique we usually use. We've got the constants that I just showed you, and then when we send a message, we have two rescue clauses, one for each. Note that use of the splat before the constant, so it's actually rescuing all of those. And then, you know, you present the best UI for that situation to the user. Of course, there are some situations like in a rate task when there is no UI that you can present to the user. So no matter what, you're pretty much just going to be rescuing all of them. And here's another example of when you want to fail aloud, send a message to your developers so they know that the rate task failed. The command line is another point of failure that we talked about. The important thing here is to always check response codes. Any, there's just a myriad of things that can go wrong when you're dealing with the UNIX commands. So it's important to either check the response codes, or look at the command output to see if there's any abnormalities. My favorite is this one. So in my sysadmin days, I took over a network that had been built by somebody who was not a sysadmin. So that being said, I knew I was in trouble when I looked at some of the cron jobs and found lines like this sprinkled across the production machines. I said to myself, if I ever remove or rename a directory, I just had to accept the fact that at midnight that user's home directory might get wiped out. Of course, the correct way of doing this, instead of using the semicolon, is to use a double land. That's the simplest way of getting by that. The next command won't run unless the first one passes. In your Ruby code, the correct way of checking the response codes is to use the actquestionmark.success. It's the same kind of issue. You've got a UNIX command that you're running, you check the success, and you deal with it. Anytime you run a UNIX command, system, backticks, whatever, you need to check the response. Another place where we see issues often is in cron jobs. So the important part about the actual cron tab itself is that you want to keep it as small as possible. Cron is not something you want to deal with if you can avoid it. It's something you want to minimize your interaction with. So we set up the mail to variable. So if there is any problem on this line, if cron is actually not able to execute that script, somebody will get emailed about it. And then all the nightly jobs we put in a cron nightly script. And this is the cron nightly script that we use, or an example of. The first thing that you want to do when you're running something out of cron, you need to realize that cron doesn't pull in any of your environment. It's an entirely a blank slate. If you want your environment, then you have to pull it in yourself. So we source our bash profile for that user. And then we make sure that we're in a directory that makes sense for us. The second thing is that especially when you're dealing with, for example, Rails application that's running in a cluster, there are some tasks that really should only run on one machine in that cluster. The way that we, I'm sorry, some tasks happen on all servers. So those are just normal script runners. They're in every cron tab across every server. Some of them only happen on one machine in the cluster. So the way that we deal with that is we have just a file that we touch called primary. We just put that in the user's home directory on the one server that should be running those one off commands. And then if that file is there, the cron job runs them. An important thing, like we said, fail loudly. We log everything that happens in the cron tab. And again, failing loudly. We use hop toad to notify on every error that happens. Now, this is another good example of how to build this sort of thing in a testable way, in a maintainable way. We put all of our logic, all of our business logic inside a Ruby module method on that, that allows you to have it inside your actual code base and not part of just some script that cron is running. I want to talk a little bit about exceptions now. How many people have seen this? And how many people actually do this? All right, that's very good. So this is called the inline rescue. And what it does is it catches quote unquote any exception and returns nil instead. So if food out bar is nil, and Baz is called on nil, then it'll just set x to nil. Usually that would result in a one nil exception. This is from programming Ruby. A rescue clause with no parameter is treated as if it hit a parameter of standard error. This means that some lower level exceptions will not be caught by a parameterless rescue class. If you want to rescue every exception, use rescue exception e. So this is the Ruby exception hierarchy. You've got exceptions, the base class of all of them. You've got no memory, script error, signal exception, which is for interrupts, system exit, system stack error. This is the stuff that's actually rescued when you do just rescue without specifying the actual exception. So that only rescues about half the exceptions that can actually happen. This will rescue all of them. In a Rails application, for example, if you do that, they will not see the 500 page, they will just see a something went wrong with this application page. So it's very important whenever you rescue, not only to not use the inline rescue because it promotes bad coding habits, but also not to use any rescues that don't explicitly say what exception you want to catch. And finally, exceptions should be exceptional. Exceptions are slow and they cut through the object-oriented principles. But more importantly, exceptions say to other programmers who are working on your application that this is an error case. If you start using exceptions to replace branching and conditionals, if you start using exceptions as part of the business logic, that's just very confusing to the other programmers. Well, I finished a little bit early. It's a, you've got about 10 more minutes here. So like I said, I'm with Thoughtbot. My name is Tamer. This is actually going to be part of what we're talking about or writing about in a book that we're going to be releasing in a, on Safari for Addison Wesley in a couple of weeks called Rails Anti-Patterns. I'm writing that with Chad Pytel. So it's going to be in the beta books in just a couple of weeks. And if anybody wants to look up some of the references that we used, they're all on this slide. So does anybody have any questions? Alright, I got a couple. You? You had a section which actually went through user.h and was actually doing user.table. You mentioned that if you had an exception happening that you had problem users saved, but not, what is the best practice in that situation? In that situation, so what he said was, I had the slide earlier that went through a list of users and then called, like changed each record called save on it with the bang. And I was saying the problem with that is that it stops halfway through the list of users and some of them are saved, some of them are not. And you can wrap that in a transaction so that the whole thing happens or the whole thing is rolled back, definitely would do that in that situation. Personally I would rather be able to show, I would rather be able to display to the end user all of the problems that happened if possible so they can address them all before they try and do that action again. So the third code example is depending on the situation what I would go with. But yes, you can wrap the whole thing in a transaction, usually pushing it down in the model layer and some sort of callback. Somebody back there had a question? Yeah? Yeah, so the question was in the Kron nightly file, every line was doing a output or it was redirecting standard out and standard error to a log file and he was saying in the Kron tab itself why didn't I do that redirect there. That is a good idea and I should have done that. I still want to keep the Kron tab itself as small as possible and that comes from my system in days have seen so many problems with a malformed Kron tab where it just doesn't do exactly what you think it's going to do. Yeah? I was just curious, is there any particular reason in that nightly Kron job when you're doing the kind of synchronization, the beginning, the she-bang, the load in the profile, whatever, you're not defining bash with an e, like with the dash e option. It seems like that would kind of fall into what you're talking about, which is fail early. So if any one of those lines would immediately pick out. Right, that's a good point. In bash scripts, if you have the she-bang line with bash dash e, any command that does not execute, any command that returns a non-zero exit code will stop the entire script and that falls with the fail early. Unfortunately, that doesn't also fall into the fail loudly, I don't think. Does it print out an error message at that point? Kron will actually generally email you if any output is sent to standard out. Okay. You could kind of put the two together. Yeah, it's usually a logging work for that to happen, but definitely that falls in the fail early. Any other questions out there? I've had to keep with the rescue exception. People, especially the library writers, are doing it thinking that they're being helpful just because they're making sure that there's anything that happens in their blog is rescued. But I found that makes their libraries actually not suitable to use the programs that have any signaling for them because they're catching the signals and they're also making sure that they're catching the system so they can't actually kill the rest of them. So I have to audit all the libraries that are used to make sure that they're not actually catching exception for rescuing them. So it's important to know when it's appropriate to use it. Yeah, that's exactly true. It's important to know when you should be rescuing. Well, more importantly, you should just be very aware of whatever you're going to be rescuing. And I think for library writers, for example with the HTTP exceptions that I listed out, I can't tell you how many times I wish that the person who wrote that thought about the sub-exceptions that were common, like timeout, and rescued those and returned them as, and then raised master exception, or just defined an exception that included all those modules. That would have done the same thing. But yes, his point was that library writers who rescue all exceptions across the entire exception hierarchy can cause problems if you're dealing with signals or if you're trying to kill threads, things like that. It's going to rescue that when you don't want it to. Were there any other questions? I see one over here. Nope. All right, well, thank you very much.