 last session before happy hour. I appreciate all of you for hanging around this long. Maybe you're here because you don't know, there's a bar on the first floor of this hotel. I think that is where the main track is currently taking place. I am Austin Putman. I am the VP of engineering for OMADA Health. At OMADA, we support people at risk of chronic disease like diabetes, make crucial behavior changes, and have longer, healthier lives. So it's pretty awesome. I'm going to start with some spoilers because I want you to have an amazing RailsConf. So if this is not what you're looking for, don't be shy about finding that bar track. We're going to spend some quality time with Capybara and Cucumber, whose flakiness is legendary for very good reasons. Let me take your temperature. Can I see hands? How many folks have had problems with random failures in Cucumber or Capybara? Yeah. Yeah, this is reality, folks. We're also going to cover the ways that RSpec does and does not help us track down test pollution. How many folks out there have had a random failure problem in the RSpec suite, like in your models or your controller tests? Okay, still a lot of people, right? It happens, but we don't talk about it. So in between, we're going to review some problems that can dog any test suite. Is it like random data, time zone heck, external dependencies? All of this leads to pain. There was a great talk before about external dependencies. Just, here's just a random one. How many people here have had a test fail due to a daylight savings time issue? Yeah. Ben Franklin, you are a menace. Let's talk about eliminating inconsistent failures in your tests. And on our team, we call that fighting randos. And I'm here to talk about this because I was stupid and short-sighted and random failures caused us a lot of pain. I chose to try to hit deadlines instead of focusing on build quality. And our team paid a terrible price. Anybody out there paying that price? Anybody out there feel me on this? Yeah, it sucks. So let's do some science. Some problems seem to have more random failure problems than others. I want to gather some data. So first, if you write tests on a regular basis, raise your hand. Right? Wow, I love RailsConf. Keep your hand up if you believe you have experienced a random test failure. The whole room. Now, if you think you're likely to have one in the next like four weeks, who's out there? It's still happening, right? You're in the middle of it. Okay, so this is not hypothetical for this audience. This is a widespread problem. But I don't see a lot of people talking about it. And the truth is, while being a great tool, a comprehensive integration suite is like a breeding ground for baffling heisenbugs. So to understand how test failures become a chronic productivity blocker, I want to talk a little bit about testing culture. Right? Why is this even bad? So we have an automated CI machine that runs our full test suite every time a commit is pushed. And every time the build passes, we push the new code to a staging environment for acceptance. Right? That's our process. How many people out there have a setup that's kind of like that? Okay, awesome. So a lot of people know what I'm talking about. So in the fall of 2012, we started seeing occasional unreproducible failures of the test suite in Jenkins. And we were pushing to get features out the door for January 1st. And we found that we could just rerun the build and the failure would go away. And we got pretty good at spotting the two or three tests where this happened. So we would check the output of a failed build and if it was one of the suspect tests, we would just run the build again. Not a problem. Staging would deploy. We would continue our march toward the launch. But by the time spring rolled around, there were like seven or eight places causing problems regularly. And we would try to fix them. You know, we wouldn't ignore them, but the failures were unreliable. So it was hard to say if we had actually fixed anything. And eventually we just added a gem called cucumber rerun. Yeah. And this just reruns the failed specs if there's a problem. And when they passed the second time, it's good. You're fine. No big deal. And then some people on our team got ambitious. And they said we could make it faster. We could make CI faster with the parallel test gem, which is awesome. But cucumber rerun and parallel tests are not compatible. And so we had a test suite that ran three times faster, but failed twice as often. And as we came into the fall, we had our first bad Jenkins week. On a fateful Tuesday, 4 p.m., the build just stopped passing. And there were anywhere from like 30 to 70 failures. And some of them were our usual suspects and dozens of them were like previously good tests, tests we trusted. And so none of them failed in isolation, right? And after like two days of working on this, we eventually got a clean R spec build, but cucumber would still fail. And the failures could not be reproduced on a dev machine or even on the same CI machine outside of the whole build running. So over the weekend, somebody pushes a commit and we get a green build. And there's nothing special about this commit, right? Like it was like a comment change. And we had tried a million things and no single change obviously led to the passing build. And the next week we were back to like, you know, 15% failure rate. Like pretty good. So we could push stories to staging again and we're still under the deadline pressure, right? So we shrugged and we moved on, right? And maybe somebody wants to guess what happened next, right? Yeah, it happened again, right? A whole week of just no tests passed. The build never passes. So we turn off parallel tests, right? Because we can't even get like a coherent log of which tests are causing errors. And then we started commenting out the really problematic tests. And there were still these like seemingly innocuous specs that fail regularly, but not consistently. So these are tests that have enough business value that we are very reluctant to just like delete them. And so we reinstated cucumber rerun and it's buddy, our spec rerun. And this mostly worked, right? So we were making progress. But the build issues continue to show up in the negative column in our retrospectives. And that was because there were several problems with this situation, right? Like reduced trust. When build failures happen four or five times a day, those aren't a red flag. Those are just how things go. And everyone on the team knows that the most likely explanation is a random failure. And the default response to a build failure becomes run it again. So just run it again, right? The build failed whatever. So then occasionally we break things for real. But we stopped noticing because we started expecting CI to be broken. Sometimes other pairs would pull the code and they would see the legitimate failures. Sometimes we thought we were having a bad Jenkins week. And on the third or fourth day we realized we were having actual failures. This is pretty bad, right? So our system depends on green builds to mark the code that can be deployed to staging and production. And without green builds, stories can't get delivered and reviewed. So we stopped getting timely feedback. Meanwhile, the reviewer gets like a week's worth of stories all at once, big clump. And that means they have less time to pay attention to detail on each delivered feature. And that means that the product is a little bit crappier every week. So maybe you need a bug fix. Fast. Forget about that. You've got like a 20% chance your bug fix build is going to fail for no reason. Maybe the code has to ship because the app is mega busted. In this case, we would rerun the failed tests on our local machine and then cross our fingers and deploy. So in effect, our policy was if the code works on my machine, it can be deployed to production. So at the most extreme, people lose faith in the build and eventually they just forget about testing. And this didn't happen to us. But I had to explain to management that key features couldn't be shipped because of problems with the test server. And they wanted to know a lot more about the test server. And it was totally clear that while a working test server has their full support, an unreliable test server is a business liability and needs to be resolved. So the test server is supposed to solve problems for us. And that is the only story that I like to tell about it. So we began to fight back. And we personified the random failures. They became randos, a rando attack, a rando storm, and most memorably rando backstabbing intergalactic randomness villain. We had a pair working on the test suite full time for about three months trying to resolve these issues. We tried about 1000 things and some of them worked. And I'm going to pass along the answers we found and the hypotheses that we didn't disprove. Honestly, I'm hoping that you came to this talk because you've had similar problems and you've found better solutions. So this is just what we found. I have a very important tool for this section of the talk. It's the finger of blame. We use this a lot when we were like, hey, could the problem be cucumber? And then we would go after that. So here comes finger of blame. Cucumber, capybara, poltergeist, definitely part of the problem. I've talked to enough other teams that use these tools extensively and the evidence from our audience, right? It's to know that the results are just not as deterministic as we want. When you're using multiple threads, and you're inserting against browser environment, you're going to have some issues. And one of those is browser environment. Browser environment is a euphemism for a complicated piece of software that itself is a playground for network latency issues and rendering hiccups and a callback soup. So your tests have to be written in a very specific way to prevent all the threads and all the different layers of code from getting confused and smashing into each other. Now some of you maybe are lucky and you use the right style most of the time by default. Maybe you don't see that many problems. A few things you got to never assume. Never assume the page has loaded. Never assume the markup you are asserting against exists. Never assume your age act requests actually finished. Never assume the speed at which things happen because until you bolt it down, you just don't know. So always make sure the markup exists before you assert against it. New capybara is supposed to be better at this. And it's improved. But I do not trust them. I am super paranoid about this stuff. This is a good example of a lurking rando due to a race condition in your browser. Capybara is supposed to wait for the page to load before it continues after the visit method. But I find it has sort of medium success with doing that. Bolt it down. Right. We used to have something called the wait until block. That would stop execution until a condition was met. That was great because it replaced sleep statements which is what we used before that. Modern capybara, no more wait until block. It's inside that has CSS and have content matcher. So always assert that something exists before you try to do anything with it. Sometimes it might take a long time. The default timeout for those capybara assertions is like five seconds. And sometimes you need 20 seconds. Usually for us that's because we're doing like a file upload or another lengthy operation. But again, never assume that things are going to take a normal amount of time. Race conditions. I would be out of line to give this talk without talking explicitly about race conditions. Right. Whenever you create a situation where a sequence of key events doesn't happen in a predetermined order, you've got potential race condition. So the winner of the race is random. And that can create random outcomes in your test suite. So what's an example of one of those? Ajax. Right. In Ajax, your JavaScript running in Firefox may or may not complete its Ajax call and render the response before the test thread makes its assertions. Now capybara tries to fix this by retrying the assertions. But that doesn't always work. So say you're clicking a button to submit a form. And then you're going to another page or refreshing the page. This might cut off that post request, whether it's from a regular form or an Ajax form, but especially if it's an Ajax request. As soon as you say visit all the outstanding Ajax requests, cancel in your browser. So you can fix this by adding an explicit weight into your cucumber step. When you need to rig the race, jQuery provides this handy counter dollar dot active. That's all the XHR requests that are outstanding. So it's really not hard to keep an eye on what's going on. Here's another offender creating database objects from within the test thread. What's wrong with this approach? Now if you're using MySQL, maybe nothing's wrong with this. And that's because MySQL has the transaction hygiene of a roadside diner, right? There's no separation. If you're using Postgres, which we are, it has stricter rules about the transactions and this can create a world of pain. So the test code and the Rails server are running in different threads. And this effectively means different database connections and that means different transaction states. Now there is some shared database connection code out there and I've had sort of mixed results with it. I've heard this thing, right, about shared mutable resources between threads being problematic. Like they are. So let's say you're lucky and both threads are in the same database transaction. Both the test thread and the server thread are issuing checkpoints and rollbacks against the same connection. So sometimes one thread will reset to a checkpoint after the other thread has already rolled back the entire transaction, right? And that's how you get a rando. So you want to create some state in your application to run your test against but you can't trust the test thread and the server thread to read the same database state, right? What do you do? So in our project we use a single set of fixture data. It's fixed at the beginning of the test run. And essentially the server thread or the test thread, sorry, treats the database as immutable. It is read-only and any kind of verification of changes has to happen via the browser. So we do this using Ryan D's fixture builder gem to combine the maintainable characteristics of factoryed objects with the like set it and forget its simplicity of fixtures. So any state that needs to exist across multiple tests is stored in a set of fixtures and those are used throughout the test suite. And this is great except it's also terrible. Unfortunately our fixture builder definition file is like 900 lines long and it's as dense as a master's thesis, right? It takes about two minutes to rebuild the fixture set and this happens when we re-bundle, change the factories, change the schema. Fortunately that only happens a couple of times a day, right? So mostly we're saving time with it but seriously two minutes as you're overhead to run one test is brutal. So at our stage we think the right solution is to use fixture builder sparingly, right? Use it for cucumber tests because they need an immutable database and maybe use it for core shared models for our spec but whatever you do do not create like a DC Comics multiverse in your fixture setup file with like different versions for everything because that leads to pain. Another thing you want to do is mutex it. So a key technique we've used to prevent database collisions is to put a mutex on access to the database. And this is crazy but you know an app running in the browser can make more than one connection to the server at once over Ajax and that's a great place to breed race conditions. So unless you have a mutex to ensure the server only responds to one request at a time you don't necessarily know the order in which things are going to happen and that means you're going to get unreproducible failures. In effect we use a mutex to rig the race, you can check it out on GitHub, it's just a sketch of the code we're using, it's omada health slash capybara sync. Faker! Some of the randomness in our test suite was due to inputs that we gave it. Our code depends on factories and the factories use randomly generated fake data to fill in names, zip codes, all the text fields. And there are good reasons to use random data. It regularly exercises your edge cases. Engineers don't have to think of all the possible first names you could use. The code should work the same regardless of what zip code someone is in. But sometimes it doesn't. For example, did you know that Faker includes Guam and Puerto Rico in the states that it might generate for someone? And we didn't include those in our states dropdown. So when a cucumber test edits an account for a user that Faker placed in Guam and their state is not entered when you try to click save and that leads to validation failure. And that leads to cucumber not seeing the expected results. And a test run from a new factory will not reproduce that failure. Right? Something crazy happened. Here we go. Times and dates. No, we're out of sync. Let me just... momentary technical difficulties. Okay. Times and dates. Another subtle input to your code is the current time. Our app sets itself up to be on the user's time zone to prevent time dependent data, like which week of our program you are on in the middle of Saturday night. And this was policy we all knew about this. We always used zone aware time calls except that we didn't. Like when I audited I found over 100 places where we neglected to use zone aware time calls. So most of these are fine. There's usually nothing wrong with epic seconds. But it only takes one misplaced call to time dot now to create a failure. It's really best to just forget about time dot now. Search your code base for it and eliminate it. Always use time dot zone dot now. Same thing for date dot today. That's time zone dependent. You want to use time dot zone dot today. Unsurprisingly I found a bunch of this class of failure when I was at RubyCon in Miami. So these methods create random failures because your database objects can be in a different time zone than your machine's local time zone. External dependencies. Any time you depend on a third-party service in your tests you introduce a possible random element, right? As three Google Analytics, Facebook, any of these things can go down. They can be slow. They can be broken. Additionally they all depend on the quality of your local internet connection. So I'm going to suggest that if you are affected by random failures it's important to reproduce the failure. It is possible. It is not only possible. It is critical. And any problem that you can reproduce reliably can be solved. Well at least if you can reproduce it you have a heck of a lot better chance of solving it. So you have to bolt it all down. How do you fix the data? When you're trying to reproduce a random failure you're going to need the same database objects used by the failing tests. So if you use factories and there's not a file system record when a test starts to fail randomly you're going to want to document the database state at the time of failure. That's going to mean YAML fixtures or like an SQL dump or something else clever. You have to find a way to reestablish that same state that was created at the moment that you had the failure. And the network. Great talk before about how to nail down the network. API calls and responses are input for your code, webmock, VCR, other libraries exist to replay third-party service responses. So if you're trying to reproduce a failure in a test that has any third-party dependencies you're going to want to use a library to capture and replay those responses. Also share buttons right in your cucumber test you're going to want to remove the calls to Google Analytics, Facebook light buttons, all that stuff from the browser. These slow down your page load time and they create unnecessary failures because of that. But if you're replaying all your network calls how do you know the external API hasn't changed? Right? You want to test the services that your code depends on too. So you need a build that does that. But it shouldn't be the main build. Purpose of the main build is to let the team know when their code is broken, when their code is broken. And it should do that as quickly as possible. And then we have a separate external build that tests the interactions with third-party services. So essentially external communication is off and then on and we check build results for both. So I want to talk about another reason the tests fail randomly. Our spec runs all your tests in a random order every time. And obviously this introduces randomness but there is a reason for that. And the reason is to help you stay on top of test pollution. Test pollution is when state that is changed in one test persists and influences the results of other tests. Changed state can live in process memory, in a database, on the file system, in an external service. Right? Lots of places. Sometimes the polluted state causes the subsequent test to fail incorrectly. And sometimes it causes the subsequent test to pass incorrectly. And this was such a rampant issue in the early days of our spec that the our spec team made running the tests in a random order the default as of our spec too. So thank you our spec. Now any test pollution issue should stand out. But what do you think happens if you ignore random test failures for like a year or so? Yeah. Here's some clues that your issue might be test pollution. Right? With test pollution the affected tests never fail when they're run in isolation. Not ever. And rather than throwing an unexpected exception, a test pollution failure usually takes the form of returning different data than what you expected. And finally the biggest clue that you might have a test pollution issue is that you haven't really been checking for test pollution. So we got to reproduce test pollution issues, which means we have to run the test suite in the same order. We have to use the fixture database data and the network data from the failed build. So first you have to identify the random seed. Maybe you've seen this cryptic line at the end of your our spec test output. This is not completely meaningless. 22164 is your magic key to rerun the test in the same order as the build that just ran. So you want to modify your dot our spec file to include the seed value. Be sure to format change the format to documentation as well as adding the seed. That will make it more readable for you so you can start to think about the order that things are running in and what could possibly causing your pollution problem. So the problem of test pollution is fundamentally about incorrectly persisted state. So the state that's being persisted is important. You want to ensure that the data is identical to the failed build. And there's lots of ways to do this. So you've got your random seed, you've got your data from the failed build, and then you rerun the specs. And if you see the failure repeated, you should celebrate. You've correctly diagnosed that the issue is test pollution and you are on your way to fixing it. If you don't see the failure, maybe it's not test pollution. Maybe there's another aspect of your build environment that needs to be duplicated. But even then, say you've reproduced the problem. Now what? You still have to diagnose what is causing the pollution. You know that running the test in a particular order creates a failure. Problem with test pollution is that there is a non-obvious connection between where the problem appears in the failed test and its source in another test case. And you can find out about the failure using print statements or debugger, using whatever tools you want. But maybe you get lucky and you are able to just figure it out. But in a complex code base with thousands of tests, the source of the pollution can be tricky to track down. So just running through the suite to reproduce the failure might take 10 minutes. And this is actually terrible, right? Waiting 10 minutes for feedback, this is a source of cognitive depletion. All of the stack you've built up in your brain to solve this problem is disintegrating over that 10 minutes. You're going to work on other problems. You're going to check Facebook while those tests are running. You're going to lose your focus, right? And that is essentially how Rando wins. Fortunately, we can discard large amounts of complexity and noise by using a stupid process that we don't have to think about. Binary search. In code, debugging via binary search is a process of repeatedly dividing the search space in half, until you locate the smallest coherent unit that exhibits the desired behavior. Okay? So we have the output of a set of specs that we ran in documentation mode. This is sort of a high-level overview that you might see in Sublime, right? And in the middle here, this red spot is where the failure occurs. So we know the cause has to happen before the failure because causality. So in the green block at the top is that's the candidate block or the search space. So practically we split the search space in half and remove half of it. And if the failure reoccurs when we rerun with this configuration, we know that the cause is in that remaining block, right? But sometimes you've got more problems than you know. So it's good to test the other half of the search space as well. So if your failure appeared in step zero, you expect not to see the failure here. If you also see the failure here, you might have multiple sources of test pollution. Or more likely, test pollution isn't really your problem and the problem is actually outside of the search space. So here's a hiccup. Binary search requires us to remove large segments of the test suite to narrow in on the test that causes the pollution. And this creates a problem because random ordering in the test suite changes when you remove tests completely. Remove one test, the whole thing reshuffles on the same seed. So there's no way to effectively perform a binary search using a random seed. So here's the good news. It is possible to manually declare the ordering of your R-spec tests using this undocumented configuration option, order examples. So config.order examples takes a block and that'll get the whole collection of R-spec examples after R-spec has loaded the specs to be run. And then you just reorder the examples in whatever order you want them to be ordered in and return that set from the block. So that sounds simple. I made a little proto-gem for this. It's called R-spec manual order. And basically it takes the output of the documentation format from the test that you ran earlier and turns that into an ordering list. So if you take, if you log the output of your R-spec suite with the failure to a file, you'll be able to replay it using R-spec manual order and you can check that out on GitHub. So it's possible to reduce the search space and do a binary search on R-spec. And once you've reduced the search space to a single spec or a suite of examples that all cause the problem, you put your monkey brain in the position to shine against your test pollution issue, right? This is where it actually becomes possible to figure it out by looking at the context. I've gone in depth into test pollution because it's amenable to investigation using simple techniques, right? Binary search and reproducing the failure state are key debugging skills that you will improve with practice. When I started looking into our random failures, I didn't know we had test pollution issues. Turned out we weren't resetting the global time zone correctly between tests. This was far from the only problem I found, but without fixing this one, our suite would never be clean. So every random failure that you are chasing has its own unique story. There are some in our code that we haven't figured out yet and there are some in your code that I hope I never see. The key to eliminating random test failures is don't give up, right? Today we've covered things that go wrong in cucumber and capybara, things that go wrong in our spec and just general sources of randomness in your test suite. And hopefully you're walking out of here with at least one new technique to improve the reliability of your tests. We've been working with ours for about eight months and we're in a place where random failures occur like less than 5% of the time. And we set up a tiered build system to run the test sequentially when the fast parallel build fails. So the important thing is that when new random failures occur, we reliably assign a team to hunt them down. And if you keep working on your build, eventually you'll figure out a combination of tactics that will lead to a stable, reliable test suite that will have the trust of your team. So, thank you.