 I guess we'll go ahead and get started. I see a few last people trickling in, but this is a pretty full room. I'm excited. Hey there, everybody. Welcome to this talk about weird things that happen in production. My name is Ryan Laughlin, or Rofreg, if you know me from the internet. I am one of the co-founders of Splitwise, which is an app to split expenses with other people. And right out the gate, I want to say I'm really excited to be giving this talk. This is actually not only my first RailsConf. This is the first time I've ever given a conference talk. And thank you. And in particular, I'm really excited to be giving this talk. I'm really excited to talk about what I think is an important gap in the way that we think about testing and debugging our applications, both in the Rails community and beyond. If you want to follow this talk at your own pace, or if you want to look back at it later, all the slides and the presenter notes I'm reading are up at rofreg.com.com. And so without other way, let's get right into it. Let's say you're building a new feature for your app. So you plan out all the details with your team about how the feature should work. You think through all the edge cases and the possible issues that you might encounter. And then you sit down to actually start writing the code. Maybe you write tests to help make sure that the feature works as intended. Maybe you do code review so that your fellow developers can help you spot potential bugs and fix them. Maybe you have a staging server or a formal QA process in order to help you catch bugs before things go out the door. So you take your time and you fix every single bug that you can find. And now it's time to actually deploy your code to production for the whole world to see. And congrats, you're done. You shipped it. Everything's great. Probably. I mean maybe you're done. I don't know about you, but I am a person who makes mistakes. And quite often when I make an update to an app, I will miss some minor bug or another and I'll end up deploying that bug to production. I mean I've worked on the same One Rails app for seven years and in that time I've probably shipped several hundred bugs. And so this is a question that I ask myself a lot. If my code has bugs in it, how would I even know? And I wanna be specific here. Note that I'm not asking how do we prevent bugs from happening in the first place? That is not what I'm asking. What I'm asking is how we detect those bugs when they happen. Because we have to expect bugs to happen in production. We should expect to make mistakes and we should expect to make them in production. A quick show of hands. Raise your hand if you've ever deployed something. Okay, raise your hand if you've ever deployed a bug. Everybody, right? Like this is a thing that happens. All the best engineers that I know have made these kinds of mistakes and that's not something to be afraid of. That's not something to be ashamed of. It's part of being an engineer. Making a mistake is a chance to learn and to grow by figuring out what went wrong and your systems need to accommodate that. Now you might think, this is what tests are for. This is why we started doing tests as an industry in a more concentrated way over the last 10 to 15 years. Test catch bugs so that we can fix those bugs before we ship them. And testing is a super important part in this process. Like tests are really good at ensuring that our code generally works as expected and they're really good at protecting our code against regressions when we're making updates to existing code. But tests do not catch everything. And in fact, I just wanna point out, it's sort of tautologically impossible for tests to catch everything because we're the ones who write the tests. And most of the tests that we write are not exhaustive. They test a handful of cases. And so if there's an important edge case that we didn't think about in advance, then there may not be any test for that edge case. Now you can improve your chances by including other people in this process, whether it's via code reviews or quality assurance. Other people can help you spot issues and problems that you might have missed by yourself. And again, this is a really important part of development, particularly in my experience. Two heads are always better than one. You're going to have better results if more people look at something before it goes out the door. But it has the same problem, which is that even a room full of very smart people are occasionally going to miss something. Especially because it's hard to hold an entire system in your head and to think about all the different parts of your app and how they might interact with each other and how they might interact with this new piece of code. That brings me to my second idea, which is that your production environment is unique. Your production environment is different than your test environments, and it's different than your development environment, and it's different than your staging environment. And that means that you may have bugs that are unique to production that you do not see anywhere else. Let me give you one quick example. If your app uses a database, which most apps do, I bet that most or all of your tests assume that that database is empty at the start of the test with no preexisting data. That is not what your app experiences in production. In production, you're working with months or with years of preexisting data, and that can lead to edge cases that you might completely overlook in your test environment because your test environment doesn't have that data. And that's just one way that these two environments differ. There are always going to be differences between your local environment and your production environment, no matter how much effort you put into making them the same. So if we know we're gonna have bugs, and we know the production is a unique environment, then it's pretty logical that we should be on the lookout for bugs that happen specifically in production, and that means that we need to monitor our production environment. There are a few existing tools for doing this, but they're not perfect, and I think they're a little bit incomplete. For a lot of apps, the first line of defense here is exception reporting, and these are things like roll bar, or century, or air brake. You can use the standalone exception notification gem if you don't wanna use a third party. These are tools that help you by sending you an alert anytime that an unexpected exception bubbles up in your app. And this is really great, right? Like if my app explodes in some unexpected way, I wanna know, but there are really big weaknesses to exception reporting too. First of all, exception reporting can be really noisy, especially if you're running a big app. At scale, you will get a lot of errors that are not your fault. People will submit requests with invalid string encodings or dates that don't exist. People will scan your app for vulnerabilities and submit tons of garbage data, just lots of really odd stuff that happens when you're a real app in the real world. And while you can tune your exception reporting to screen out a lot of these false alarms, in my experience, there will always be new and exciting exception types caused by really odd, unimportant user behavior. And a consequence of this is that because there are so many unimportant alerts, it means the signal to noise ratio is really low. When you have one critical exception in the middle of 20 false alarms, it's actually pretty easy to overlook it. It's like the boy who cried wolf, right? Like when something seriously happens and you need to pay attention to it, you might not be paying attention because you've filtered out in your brain the 20 other false alarms that came beforehand. Also, very importantly, exception reporting can only catch exceptions. So if you're only looking for exceptions, there are entire categories of bug that you might miss where the code does run without crashing, but it returns the wrong result. So here's a very simple example of making a typo in string interpolation. We're using parentheses around a name instead of curly brackets. And this returns a result. This won't trigger any exception reporting, but this is very clearly doing the wrong thing. And this is a contrived example, but it's surprisingly easy for this kind of issue to slip by in production. So what else do we have besides exception reporting? Well, we have bug reports. This is often the last line of defense in real production apps, reports that come directly from your users. If you break something hard enough, your users will tell you about it. But there are big problems with this approach too for kind of obvious reasons. First of all, it's a really bad experience. Bugs make people frustrated and angry and confused and it makes people lose trust in the thing that you've built. Nobody likes using buggy software, it's a bad experience. Second of all, a lot of people won't bother to report issues. It takes time to write somebody an email. If I see an obvious problem with your app or your website, nine times out of 10, I'm just gonna leave your site. I'm not necessarily gonna spend the time to write you a nice long bug report with full repro steps, especially if I'm a non-technical person. It's a lot to ask from your users and not all your users are going to do it. And finally, users can only report the problems that they actually see. If you have a bug in an internal system or a background job or something like that, it's very possible that no one will notice for quite a long time and that the bug could cause a lot of damage before anyone is even aware of its existence. So if something wasn't caught by tests and it wasn't caught by QA and it wasn't caught by exception reporting and it wasn't caught by a user's bug report, how the heck are we supposed to know about it at all? How can we catch silent bugs? And the very simple answer to that is that you can't. You can't fix something that you don't know about. Like, obviously, you just can't. And so the question is not, how do we catch silent bugs? The question is, how do we turn silent bugs into noisy bugs? How do we make the low-level problems that we don't know about actually make noise? We need a system that makes noise. We need a system that tells us when something unexpected happens so that we can investigate what went wrong. And we've gotten pretty good at this in development, actually. This is where test suites really shine because when you make a change to your app and suddenly a dozen tests fail, you know that something unexpected has gone wrong and you know that you need to look into it further so that you can fix it. So what would be really useful is something that's like a test suite but that's focused on production. Something that doesn't test specific edge cases but monitors your app for the existence of issues in general. And that's where checkups come in. Checkups are tests for production. The same way that a test suite tells you when something is broken in development, a checkup suite tells you when something has broken in production. Let me walk you through this. So first of all, to write a checkup, we need to declare some expectations about how our app is supposed to behave when everything's going well. So for example, in my app, I expect every user to have a valid email address. If I were to write a checkup, it would be a block of code that helps me verify this. Does every user have a valid email address? I don't actually know unless something checks. This checkup then runs on a regular basis many times per day, checking to see if anything unusual has happened. And this is important in production, right? Because maybe all my users had valid email addresses at 2 p.m., and maybe they all had valid email addresses at 3 p.m., but maybe something happened between three and four. Even if I haven't deployed anything new recently, it's possible that a new bug may have bubbled to the surface for the first time since my last deploy. And a checkup can help you detect when that happens. Finally, if your checkup fails, then you need to be alerted so that you can investigate what happened and fix the underlying bug. Once you get that alert, you can start to figure out what the problem is. And that's the whole idea. It's pretty simple, but it's really, really powerful. Because what checkups do is they help you detect symptoms so that you can fix the cause. They are the number one tool that I know for discovering issues that you didn't even know about. It's a lot like getting a checkup with a doctor in real life, right? You go not expecting to find anything wrong, but if you find something wrong, detecting that problem early and fixing it before it becomes a bigger issue makes a huge difference and is way less painful. Ounce of prevention is worth a pound of cure. So to illustrate, let me give you a real example that we had at Splitwise a couple of years ago. And this is when we added multiple email support for our users. So at Splitwise, we have a user model, surprised. And for a long time, it was a really simple user model. We made it with AuthLogic way back when. A user had one email address, really not that complicated. But there came a time when we decided, okay, we should add multiple email address support. That's a good, useful feature. People would take advantage of that. So we made a new email address model and we added this has many relationship so the one user could have many email addresses. And as we polished up this feature and we started writing tests, we realized, oh, right, hang on. Okay, we need to make sure that all users have at least one email address. That's important. I don't want any users with no email addresses. So we added a validation in order to make sure that every user has at least one email address. And this worked great. Our test pass, everything was perfect. It's a very straightforward looking bit of code. Rails does not have a has at least one relationship but this is kind of the most terse way that you can express that. And in fact, I checked before this talk, if you Google Rails has at least one, this is the first stack overflow result. This is like the standard way to do it if you're stumbling around on the internet trying to figure out what you're supposed to do. And so we wrote a whole bunch of tests to make sure that this worked the way that we intended. If you tried to delete a user's last email address, the validation would not let you continue. So this was well tested. Now, I want you to look at this code for a few seconds and I want you to think about what might go wrong. And again, let me be specific here. I'm not asking you to actually figure out what the bug is. I'm asking you to think about what might happen if there is a bug. If there is a bug, how will we find out? What is the thing that we will notice in production? Because checkups are really, really good when you have a hunch that something might go wrong. You think your code's fine but you want some extra insurance just to make sure if this does something unexpected I know about it right away. And this is the same reason that we write tests, right? Like, when I write code in general, I'm generally pretty confident that I've written my code correctly but tests give me more confidence in my work. Tests sort of double check to make sure that I haven't missed something. And checkups work the exact same way. So in this case, we thought, hmm, all right, well, I mean, we had that temporary bug where someone ended up with no email addresses till we wrote tests for it. So maybe we should write a checkup for that. Maybe there's still some edge case where a user can end up with zero email addresses. And so we wrote this. This is a checkup. It's very short. It's very simple. First, we fetch all of the users who have recently updated their accounts and then we iterate through those users and check to see if there are any users with zero email addresses. We run this once per hour. And if we find any users who don't have any email addresses, then this checkup sends an alert to us so that we can start to investigate and figure out what happened. It is literally five lines of code. It is very simple. This isn't some crazy complicated technical magic. It's the kind of thing that you'd write from the Rails console if you wanted to check this yourself. And so we deployed our new feature and we included this checkup to make sure that we hadn't missed anything. And for the first day or two, everything was great. And after a few days, sure enough, our little checkup sent us an alert. Someone had slipped through the cracks. There was a user who had ended up with zero email addresses despite our tests, despite our conscious attention, despite all of our best planning. And so we investigated. We looked through our logs for this user and realized, okay, interesting. So this user actually, they used to have two email addresses. They added the second email address. And then they tried to delete both of them at the same time. And we realized, oh, okay, wait. There's a race condition. One that we hadn't anticipated when we were first writing these tests. See, if you have a user with two email addresses and they submit two different requests at the same time and each request is trying to delete a different email address, both of those requests pass validation. In request number one, the user still has one email address left. So Rails thinks it's totally valid. And the same is two in request number two. And because it's pass validation, those deleted email addresses then both get fully deleted from the database. As a result, you end up with an invalid user with zero email addresses. That's a really obvious bug. Now that we know that it's there, but we had totally missed it on our first pass. Thankfully, because we'd written a checkup, we were able to discover this bug really quickly, the first time that it happened. And because we discovered it so quickly, we could fix it right away. So this is an example of how a few lines of code running occasionally on a repeated basis can make a difference, can help you spot something that otherwise you would have missed and that otherwise would have started to impact user behavior and the function of your app. In this case, the user with zero email addresses, they're not able to log back into their account maybe. That's really bad. Depending on what you do, that can be really, really bad. So how do you write a checkup? How do you apply this? Well, here's the same little short code snippet from before. And there are a couple of ways that we can take this code snippet and finish turning it into a fully functional checkup. So method number one is by turning it into a rake task. This is actually how we do most of our checkups, it's split-wise. It's pretty easy to set up a rake task as a recurring cron job, so that it gets recalled on a regular repeating basis. And we use Heroku, it's split-wise, so we use Heroku Scheduler for this. And that makes it easy to configure a rake task to get called once a day or once an hour or once every 10 minutes so that we can make sure that these checkups continue to pass. Another good option is an after-commit hook or an after-save hook. This is an active record callback that executes after your model has been fully written to the database after commit in particular. And if you've accidentally written something incorrect to your database, like a user with zero email addresses, this is an excellent place to catch it. I should note that this approach comes at a little bit of a cost. You're adding overhead every time that you save an active record object. But that said, it gives you immediate feedback about any unusual errors. So it can be a really good option if you're writing a checkup about a mission-critical part of your app in particular. You can also kind of split the difference in perform checkups and background jobs. So this is a great way to perform checkups on demand in response to a specific user action, like updating a record or making some other kind of change. But without slowing down the actual request too much, you get to verify just a few seconds later that everything's in order. And if it's not, then you go and respond. You go fix your code. And honestly, that's just a start. I mean, checkups are intentionally a pretty general idea. And there are a lot of places where you can use this concept. We've had places where we use it, where we write checkups that run inline and controller actions to make sure that things are actually progressing the way that they expected before the controller action even finishes. We've used them in service objects in order to encapsulate them nicely and to understand exactly what's going on. Like it's a very broad concept that can be used in a lot of different places. Okay, cool, a different question. When should I write a checkup? What kinds of problems can checkups catch? Because developing an intuition for this is hard. It's kind of like when you started writing tests for the first time. Like should you test everything? Maybe, I don't know. As we've already seen, checkups are really good at sniffing out race conditions is one that I wanna highlight. I think race conditions are maybe the best example of a problem that is really rare in development and testing, but absolutely common in production. And if you're like me, you probably find thinking about race conditions really hard. My brain is not built to think in parallel threads, but in production, that's what my app faces all the time. It's extremely common not only to see many users trying to use your app at the same time, but to see an individual user who's submitting multiple simultaneous requests at the same time. And checkups can help you detect when this has caused something really weird to happen. Invalid data is another thing that comes up pretty commonly in production that you don't really see in development. The longer that you run an app in production, the more likely you are to accumulate some weird, malformed, improper records in your database, whether that's MySQL or other data stores, things like caching layers, Redis, Memcache, even static files in S3. Just as you accumulate stuff, the likelihood that some of it is in a weird format that you didn't expect just keeps growing. So again, let's go back to the zero email address's problem. So we've found this bug. We wrote some new tests. We wrote a fix and we deployed it. And the problem was solved. Except that the problem was only solved going forward. If we wanted to solve this problem like 100%, we needed to go back and fix the existing invalid records that were now in our database. There were still several users who didn't actually have any email addresses. And just because our test suite passed and this was no longer possible to happen for new users, didn't mean that the old ones had been cleaned up. We had to hand fix those records before the issue was fully resolved. And again, this is a common pattern and a difference between development and testing and production. In development, on purpose, you generally work with a clean slate. Like, you're encouraged to clean out your database very regularly. And there's a lot of important reasons why that's a good idea. But that's also not how production works. In production, you might have malformed records that were caused by bugs that happened months ago or years ago. Most of the time, that ends up being okay. Like, I think almost every app has a couple weird bits of data floating around somewhere. But sometimes that malformed data is really important to catch and to fix. And checkups are a really excellent tool for sniffing that out and fixing it. I also want to call attention to this method. Raise your hand if you've ever used the update column method in ActiveRecord or update columns. Okay, yeah, so you might have invalid data in your database. This skips validations. And that's kind of the point, right? There are times when we want to give our validations the run around. But if you've ever used this, even once from a console and not in code that you ship as part of your compiled or assembled app, there's no guarantee that your data checks out. You might have accidentally made a typo and inserted something weird. Checkups are also a really good tool for when you know that there is a bug, but you have no idea how to fix it yet. So you can use a checkup to gather additional diagnostic information about a bug that you don't understand. Here's a model that's doing something weird every one in a thousand times. We have no idea what the cause is. But if we check every time that we make a change to this model, then we can record the types of users and the types of interactions that are causing this to happen. And not only that, depending on what the bug is, we might be able to paper over it in real time. If there's a programmatic way to resolve that issue once it's discovered, then we can write a checkup that not only detects the issue and tells us about it, but actually fixes it so that the impact in production is minimal. And this can buy you some really important time while you continue to investigate the underlying problem that is causing your bug in the first place. Finally, checkups are really valuable if you're anyone who does some sort of ops work in production. In fact, just to be explicit about it, the whole idea of a checkup is basically a borrowed concept from ops, right? Like ops is all about checkups. Ops is the home page up right now. Do we have an email backlog? Checking these things and wanting to know right now if something is wrong or if it's in an okay state. Checkups are all about evaluating system health in that way. And so A, if you're someone who does ops, this is useful. But B, it can be useful even if you're not a person who's in charge of ops for your application because checkups can alert you to unexpected changes in behavior that might be caused by your code, not by external circumstances. If your app usually processes 1,000 background jobs a day and it suddenly starts processing 100,000 background jobs a day, that could be a bug in your code. You might have an infinite loop somewhere that's spawning a ton of jobs. You might have deployed something that's accidentally enqueuing way more jobs than you intended it to. Having an early warning about these sorts of things gives you a chance to say, okay, wait a minute, did we change something? Like why is this happening? And I know a lot of Rails developers, as we were talking about in the keynote this morning, Rails lets you run a lot with a little. Like you can be a small team doing a lot. And so I know personally, like I'm someone who did not have ops experience when I started my job and Rails sort of made that part of my job. Like I learned how to run something by myself and I know there are a lot of small apps where you're wearing a lot of hats and this is a really valuable thing to have in your toolkit. At Splitwise we have an entire suite of checkups like this. Like I said, we think of it like a test suite. Like it's not just where's the test, it's where's the checkup. Some of them run daily, some of them run every hour, some of them run every couple of minutes. Sometimes our checkups are exhaustive, which means they check every single record that's been recently updated because we don't wanna miss a single problem. This is a really good approach if you have something mission critical and you need to make sure that it's always right. Other times they're just spot checks. They're not meant to catch every single error that happens, but they let us know if an error is happening frequently enough that it's gonna be a big problem. They give us some sense that something is beginning to degrade even if that degradation is still with an acceptable parameters. So I wanna give you another example where a checkup totally saved my butt in real life. Just to drive home how big a difference a good simple checkup job can make. So my company, Splitwise, makes an app that helps people share expenses with each other. And one of the very most important things that Splitwise does is calculate your balance with another person. So for example, you owe ADA $56. It is very important that we get this calculation right. And we have a bunch of tests to validate that everything adds up correctly in every possible edge case that you could think of. Like this is the bread and butter of what we do. But one random Tuesday, suddenly everything went wrong. All of a sudden, our code started returning two different answers for the same calculation. So when I asked how much do I owe ADA, our Rails app might reply $56, but it also might reply $139. The result was completely random. And I mean like literally random. It was like flipping a coin. You would randomly get one of these two possible answers back every time you called user.balance. This was obviously a huge user facing problem. It is massively confusing and seeing the wrong balance would destroy a user's faith in our app. Like this is literally our one job. Our one job is to keep track of your expenses for you. And if we can't do that, then why would you use Splitwise at all? Why are we here? And here's the kicker. We had not deployed anything new all day. In fact, we hadn't touched anything related to this balance calculation code in weeks and weeks. Nothing had changed at all. We had no reason to expect that anything should go wrong. And yet, it's 1 p.m. on a Tuesday, and our checkup goes off. We have a checkup that says, hey, you use caching in your balance layer. And it's really important to make sure that cached values are correct. So why don't we grab everyone who's been updated recently and check that cached balance value against what the value would be if we recalculated it from scratch right now? By comparing these two values, we could continuously verify that our cached optimized balance method was working as expected. And if anything went wrong, we could not only raise an alarm about it, we could clear the cache and get rid of the incorrect value. And holy crap, that literally fixed the problem. Like not only did this checkup task alert us about the problem right away, it actually mitigated the problem in real time while we scrambled to figure out the cause and fixed it over the next couple of hours. And in the end, something that could have affected thousands and thousands of users sending us angry emails affected zero. Nobody noticed, nobody contacted us. If you're curious about the actual underlying details here, this turned out to be a critical infrastructure problem with our third party caching provider. They had like an issue with a cluster and we detected it so fast that we alerted them instead of them alerting us. In terms of days that could have been the worst that turned out to be the best. Like without this checkup, this day probably would have been one of the worst days that I've had as an engineer, like a massive problem in production. But instead, it's a day that I'm really proud of. It's a day where we not only had the foresight to fix our own problems. Again, we literally helped this third party get on the problem faster and that helped other companies too. Like as an engineer, that's all I can ask for. Like building something that's stable and that not only helps us, but helps others. Like that was really cool. And so I wanna share a few final thoughts about checkups in order to wrap up here. First of all, I wanna be very explicit. This is a work in progress. Checkups are literally an idea that I made up that we use internally, it's blow-wise. And as I mentioned at the start, this is my first big public talk. This is my first time really trying to spread this idea outside of my own workplace. But I know for a fact that this is a common issue because I've talked to friends at a bunch of different companies and a lot of them have something like this. They have some kind of internal system that double checks their production environment to make sure that certain things haven't exploded. The problem is almost no one talks about those systems and those ideas in public. They're very siloed. And if you're a Rails developer building a new app, the way that you learn this stuff is mostly through trial and error and painful experience. It's not yet a part of our standard discussion about the things that happen when you're trying to build and deploy an app. And in part, I think it's because we don't have words for it yet. We don't have a pre-existing vocabulary about how to double check our production systems. And because we don't have that vocabulary, we don't have best practices yet either. We're not thinking about this problem in a communal way. We're not learning from each other yet. Like, I have years of battle-tested experience and if it weren't for this talk, I don't know how I would really share it right now. There are ways, but there's not an ongoing conversation about how to sniff out production-specific issues and how to make that a healthy part of your standard workflow. My hope is that the idea and the concept of a checkup can be somewhere for you to start. I think it's a good, intuitive framing for how to sniff out unexpected bugs in production. And if you think about your own apps through this lens, I think that you'll begin to see how checkups can help you build something that's more robust and more healthy. And I honestly believe that every app should have a checkup suite once it's big enough, just like you have a test suite. Like, you definitely can deploy a successful app that doesn't have tests or doesn't have checkups, but you're leaving yourself blind to a lot of potential problems on headaches. So with that in mind, I know that building a whole checkup suite sounds pretty intimidating. So here's my suggestion for one specific small place to start. ActiveRecord has this method called .valid. It runs your validations. It makes sure that your validations pass and it returns true or false. Well, you can take advantage of that really easily. You can write five lines of code that grab all the recently updated records in your database and then iterate through them and call .valid on each one to make sure that the persistent data still passes validation. Again, this is literally five lines of code. Like, it's a pretty easy place to start. And if you run this on all of your ActiveRecord models, I'm confident that you will find some invalid records that manage to weasel their way into your database. Like, you will be surprised at what you find, especially if you go back in time and run this on all of your historical records, not just what happened in the last hour. And finding problems is really the start of the battle. Like, once you find a problem, then you can start to fix it. Again, my name is Ryan Loughlin. I'm Rofreg on Twitter. All these slides are at rofreg.com slash talks if you wanna reference back to them. I really care about this idea a lot, so I would really love to answer questions and also just to talk about this in general during the conference, come find me. That's about all I got.