 Welcome, thank you for being here. So I recently revisited one of the most sophisticated applications that I've ever been a part of. A key feature of this application took a complex input, in this case a huge production database connected to a CMS, and fed all that data through a Ruby function consisting of hundreds of lines of raw Postgres. Look all that data and spit it out into a CSV file that was between five and ten thousand lines long, representing every possible permutation of the data, and so way too long to be read or reasonably understood by a human. So this was a complex feature in a complex application. Now given that I was one of the original authors of this feature, you might think that it was easy for me to just jump right back in and start working on it again. But that was not my initial experience, and in support of this, I referenced something known as Eglson's Law. Any code of your own that you haven't looked at for six or more months might as well have been written by somebody else. My point is that old code is challenging, and it doesn't matter if you wrote it or somebody else. We call this code legacy code, and many of us deal with it every day. For those of us who have not yet experienced it, let me walk through some of the hallmarks of a legacy code base. First you'll see old interfaces. In this application we were dealing with aging versions of Ruby and Rails. If you open up the gem set, you're going to see gems that have been forked, gems that are no longer being maintained, and gems that have no path to upgrade. And each of these provides its own unique obstacle to future development. You'll see vulnerabilities. Because you're not updating the software, you're not getting security patches, and your code is more vulnerable. And you'll see dead code. Poor cleanup over time leads to code that never, ever gets called. There's the saying you aren't going to need it. Well, that wasn't followed, and instead waterfall was. We have code that is reliant on abandoned design patterns and things that we now consider to be anti-patterns. So these are some of the downsides of legacy code, but there are some benefits, too. And I would list these as profit and users. So starting with profit, a legacy application is very often an application that is making somebody some money. And if you're a developer working at a business for the legacy code base, the fact that you're there strongly suggests that somebody is making some money. And you hopefully have users. Users who care, they've invested in your product, and they have a contract expecting you to do a certain thing time and time again. And that contract is very, very special. One thing I know for sure is that developing legacy code is inevitable. And I mean this in two ways. The first way is it's inevitable for you. If you are not working on a legacy code base now, if you stay in Ruby on Rails, that will be part of your career. And it's also inevitable for the applications themselves. I believe that no application is ever truly feature complete. We're going to always be wanting to develop and add features. And the progress is going to continue. When that happens, we hopefully have tests. In the application I was working on, we did, luckily. And we had coverage and design that still made sense to us a year down the road. But what happens if we don't? Well, now we're talking about something a lot worse. And that is untested legacy code. If you're going to continue developing a Ruby on Rails app that doesn't have tests, you're going to have to retroactively find a way to add tests. Or you're going to break features and you're going to negate that contract between the application and the user. There are three types of tests most of us are familiar with, unit test, API test, and user interface test. Unit test, test individual functions and classes. API test, test the way that different parts of the application talk to each other and to external parties. And user interface tests, also known as integration test or feature test, test behavior from a high level. So if we needed to start from scratch, testing an untested legacy code base. Are these three types of tests enough? Well, they each have their own individual shortcomings. For unit and API tests, there could be thousands of functions and endpoints. And you have to know which ones are actually being used or you're going to waste your time running a lot of tests. For user interface tests, we have to know what a user actually does on the site. Figuring out which types of tests to write and what order is hard and pretty subjective and each type of test has its own unique blind spots. So I have a metaphor that I'd like to introduce now that I'm going to be using throughout my talk. And that is taking a watermelon and throwing it into a fan, a big fan that can chop up the watermelon and splattering it onto a wall. Let me explain. So we start with a complex input. That is the watermelon and the metaphor. In this case, that's a production database connected to the CMS. That large production database. We take the watermelon and we throw it as fast as we can into the fan. And the fan can chop up the watermelon. Following along the metaphor, the fan here is that Ruby function with all the SQL and then that splatters the watermelon onto the wall. And that is our complicated output, our 5000 plus line CSV file. Now there's a feature of this type of system that's interesting. Changes to the fan are really hard to detect. So if I take a watermelon and I throw it through the fan today and then I take another one and I throw it through tomorrow. If I examine the two splatters on the wall, it's very difficult to tell if the fan changed at all. But detecting changes to the fan are the only thing that the stakeholders really care about. They want to know that we can guarantee consistent output time and time again. Which leads me to a question. Are any of the traditional tests, unit tests, API tests and user interface tests really equipped to cover this feature? Well, the closest is a unit test. But the isolation of a test database is never going to come close to the complexity of the production database or the watermelon. The closest thing, yeah, so, but we have to test. And the reason we must test is because we want to keep adding features while we also must preserve behavior. This is a problem. And so what are we to do? I have a solution and it's something I've built in production called the gold master test. My name is Jake Worth and I'm a developer at Hash Rocket in Chicago. This talk will be 38 minutes total and 61 slides. And this is not a Rails talk, it is a general programming talk. Here's my agenda. I'll start off by defining a gold master test. Then I'll talk about writing the test. And finally I will discuss working with the test. Part one, defining the gold master test. So to define this test, I want to talk about a test that is similar to a gold master test and then use that definition to hammer out the definition for the gold master test. The seeds of this idea come from a book that came about in around 2005 called Working Effectively with Legacy Code by Michael C. Feathers. In the preface to this book, Feathers defines legacy code as code without tests. And that perfectly fits our working definition of untested legacy code. He sums up a key idea in the book with the following quote. In nearly every legacy system, what the system does is more important than what it's supposed to do. So the behavior of the legacy system isn't right or wrong. And those terms don't really have a meaning in a sense. It simply is what it is, it does what it does, and that is a contract with the user. This comes from a sub-chapter in the book about a type of test called a characterization test, and here's that definition. A characterization test is a test that characterizes the actual behavior of the code. Once again, it isn't right or wrong. It simply does what it does, and that is the contract the users have come to expect. In order to write a test like this, Feathers introduces a process, an unnamed process, which I'm calling the characterization test process. Here it is. Step one, use a piece of code in a test harness. Step two, write an assertion that you know will fail. Step three, let the failure tell you what the behavior is. And step four, change the test so that it expects the behavior the code produces. Here's what such a test might look like. So we start off, we say, expect something to equal two. We run the test one time and it fails. It says I expected two and I got one. Then we change the test to say expect something to equal one. Run it again and it passes. Has anyone ever written a test this way? Okay. In any other context, this is a very lazy way to write a test because you're just avoiding all the upfront work of trying to figure out what the code does. But if you accept the premise that all that matters is what the legacy code does, then this actually makes perfect sense. Feathers goes further with a heuristic to explain when such a test is applicable. And I've abridged it to fit it on the slide. So here we are, the heuristic for writing characterization tests. Step one, write test where you will make changes. Write as many cases as you feel you need. Step two, look at the specific things you're going to change and attempt to write tests for those. And step three, write additional tests on a case by case basis. So if this is a characterization test, here's how it differs from a goal master test. The characterization test focuses on areas where you will make changes, as you can see in the first bullet. It cares about everything on the micro level. It cares about the specific things that you're going to change from the second bullet. Feathers says that these are not black box tests. We can actually, the black box test being those tests where you make assertions about the dimensions of the black box, but you can't open it and see what's inside. So he says that a characterization test is not a black box test. Now this is all the opposite of a goal master test. A goal master test focuses on the entire application as a whole. It only cares about the macro level, not the micro level. And it is intentionally ignorant to what's happening inside the black box. With all that in mind, here is my definition of a goal master test. A goal master test is a regression test for complex, untested systems that asserts a consistent macro level behavior. The image on the right is one of the Voyager golden records, which was launched into space in 1977 to show the galaxy, the sounds we had produced up to that point on Earth. So let me break down my definition a little here. A regression test. We have features that we like and we don't want them to go away. So we're writing a test that tries to prevent that. It is for complex, untested systems that asserts a consistent macro level behavior. Consistent macro level behavior for me means application works in the broadest possible sense and we want it to continue to work in the broadest possible sense. As I said, this definition is mine and that's because a lot of ideas in computer software, it seems like this came from many different places at once and it's hard to find one canonical definition, but this is what I'm going with. So let's look at a sample workflow of a goal master test. The first run is kind of boring. On step one, we restore a production database into our test database. That's the watermelon. In step two, we run the goal master test that chops up all the data, like the fan. In step three, we capture the output, which is the splatter on the wall. And in step four, we ignore the output. So that's all that happens on the first run. And you have basically set up the artifact that you will need for your subsequent run, which is more interesting. Here's that. On the subsequent run, we do the same thing. We restore the production database into the test database. We run the goal master test again. We capture the output. And then we compare that output to the previous output. And if there's a failure, then we get down to step five, which is we have to change the code, or we commit the new gold master as the new standard that we're holding everything to. Failure is going to prompt some sort of decision and if you've written the test correctly, you shouldn't be able to bypass that failure. You have to decide what to do. The ideal application for a goal master test has three things in common. It is mature, it is complex, and we expect minimal change to the output. So it's mature, there's behavior in there that we think is important, but it's not covered sufficiently by a test. It's complex, complex enough that a few unit tests and an integration test on top are not going to be sufficient. And we expect minimal change to the output. There's a contract established with a user that we want to persist. There's some benefits to adding a test like this to your code base. And here they are. First, we get rigorous development standards. This is a very high bar for the developers on your team. You're basically saying nothing in the entire application or a giant wing of the application should change in any way. And if you are using test and running tests and you should be able to add that to your testing cycle where the tests are green, they're green, they're green, and suddenly they're red. And you realize that you've changed something. It will expose surprising behavior. If you have code that is non-deterministic or returns a different result based off of the operating system it's running on, a Goalmaster test is going to catch that much more quickly, I would argue, than a unit test or an integration test. Because of how granular the test actually is. And it's useful for forensic analysis. Because we have a test that covers the whole application, if something breaks we can go back through time using a tool like GetBisect. And we can figure out exactly when it broke. So once again, here's my definition of a Goalmaster test. A Goalmaster test is a regression test for complex, untested systems that asserts a consistent macro level behavior. So now we have a working definition of a Goalmaster test on to writing one. So part two, writing test. We're gonna be looking at a little code now in Ruby, RSpec, and Postgres. But quickly back to the feature. So we have a feature here that has a large production database. It's fed through a complex Postgres function, and it's output into a large CSV file. And this makes this application pretty much the ideal application for our Goalmaster test. When I write a test like this I like to break it into three phases. Preparation, testing, and evaluation. So starting off with our preparation. We start off by, we have to build that watermelon. And the way that this works is we acquire a dump from production. So you get that from your production database server, and you pull it down to your local machine. The very first step is to scrub the database of sensitive information. So you wanna get rid of email addresses, IP addresses, sessions, encrypted information, financial information. This is really, really important because at the end of the step we're gonna check in some or all of that database into version control. So if you don't scrub the data, it's a vulnerability. Very, very important. The way that I recommend doing that is to use a local database, a utility database that you can dump the data in, and then run a scrub script against it. Which will make the process very repeatable. Because this is something you will have to do more than once. So, once we have that scrub data, we need to dump it out as plain text SQL. And on our team we wrote a small rake task just for that export. So here it is, it's called createBB. We name a destination file, which is called gold underscore master.sql. And then we shell out using the pgdump utility. We list our sanitize utility database, that's where the production database is currently stored. We pass it some postgres flags, and then we send it to our destination. Notice the postgres flags is a bit of a hand wave, it's inside those two sideways carrots there. And that is because dumping a production database into a test database is going to take a small amount of data massaging. They're not the same environment, and that's going to differ based on your application. But I will post a link to a GitHub branch at the end of this, which shows an example of some flags that we found useful. So when this is done, we check that plain text SQL into version control. And this is the artifact that starts off the test. So next we move into our preparation, into our test phase. So I like to start with an empty test file, anytime I'm running a test, just to validate the setup. And here's what that empty test file would look like. So I'm describing a class called fan and a function called shred. And I'm saying that it produces a consistent result, and this is how I would start, and it should pass. Stepping into the test, the first thing we have to do is take that production database and dump it into the test database. So here's a way that you could do that. We use application record, connection, execute. We pass it a here doc. We start off by truncating the schema migrations table. And we found that to be a pretty much a guaranteed conflict anytime we try to do this. So we just empty out that table, and then we read our goldmaster.sql into the test database. The result of this is you have a test database that is full of your production data. And this is a richer testing environment than most of us have probably ever gotten to use before. So the next thing we do is we perform the transformation, the fan of the metaphor. So we call the function, we call fan.shred. And we return it to a variable called actual. Now, fan.shred is written in such a way that it returns something meaningful, something we can make an assertion about. And in this case it returns that CSV output. So that's something you'll have to decide if you are writing such a test. And we assign it to a variable called actual. And that is what we're going to make our assertions about. So here's my strategy for making those assertions. The test can do two things. On the first run it can generate the goldmaster. And on subsequent runs it can compare the current result to the goldmaster. And this is a literal interpretation of that flowchart I showed earlier where the test can do both things. It can make the goldmaster and then it can compare the goldmaster. I like that because I like the developers who use the test to be able to run it every time without knowing what it is without having knowledge about what a goldmaster test can be. I think it also makes it more easier to generate the goldmaster over time if you ever lose it or decide to change it. So that's a decision that we made. I would rather the test is always run rather than a bunch of information has to be absorbed first. So now that we have our actual variable it's time to start making some assertions about it. So we list a file called goldunderscoremaster.txt. That is going to be the location for the present and future goldmasters. The first thing we do is we check to see if it exists. If it does not exist then we write our actual to that file. And this is going to work on the first pass because the save will return true. It's kind of a no op in a way. It just made a file for us to use. And that is the end of the first run. So the test passes and all as well. The second pass again is where things get a little more interesting. So our if is not true so we move into our else. And if the goldmaster file exists we read the file. Then we compare the goldmaster to actual. If the goldmaster file does not match the actual then we write to the goldmaster file. And if you've checked in that goldmaster then this adds unstage changes to your version control which you'll have to deal with. And that's a deliberate decision I'll talk about in a second. Finally we make the assertion. If actual does not equal goldmaster then the test fails. And it will fail pretty loudly depending on how you've written the test. This is the entire test file for people watching on the internet in the future. It is 19 lines. Next we move into our evaluation phase. So on a stable system this test should just be passing and passing and passing. And if it fails that is an alarm that you have broken the contract with the user. And like any good regression test it's simply trying to prevent that type of thing from happening. If it does fail the test, the checked in goldmaster is going to be noted by your version control. You have changed that file. So you're gonna have to make a decision. What happens then? Well here's a flow chart to explain. So we start off with a test failure. We look at the failure and we say is it a valid change? Is it a desired change? If it is a valid change then we check in the new goldmaster and we can continue on our way. If it is not a valid change then we need to pause and reevaluate. We need to grab a wrench and open up to fan and figure out what it is that we broke because we have broken the contract with the user. So now that we know what it's like to write a test what is it like to work in a code base that has such a test? Part three, working with the test. So I'd like to look at a sample workflow for a developer who has this test in their suite and that developer is me. And then I would like to explore some advanced applications. But first we'll start off with a real world example. This is, today I learned, available at til.hashrocket.com. It's an open source project that I help maintain at Hash Rocket. And it allows my coworkers to publish 200 word posts or less about things we're working on every day with code samples written and marked down. Now the feed of the site is always listing the newest post on top. And that gives it that constantly refreshing feed like a social media site. And this incentivizes people on my team to always try to generate new content because that's the prime position to be in on the site. So let's say I wanted to write a goal master test for today I learned. First off, does that even make sense? Well, here's a checklist for an ideal application. It needs to be mature. Today I learned is over two years old. So in the world of web development that's not really new anymore. And I feel that it is close enough to mature for our purposes. It needs to be complex. Beneath the veneer of today I learned the basically one page that anyone looks at there's a pretty complicated user interface that allows people to write posts in a way that we like a lot. And beneath the somewhat simple web application you have Rails, you have Ruby, you have the entire technology stack and that is a very complicated system. So we will say that yes, today I learned is complex. And finally, we expect minimal change to output. This is very true for me. I don't, as I said, nothing is truly feature complete but people have come to the site for a couple of years and my coworkers use it on a daily basis. They expect it to do certain things and I would be upset if that were to break. So here's my assertion. Here's my assertion. The homepage, given the same data, should not change without us knowing why. And if you check out Hash Rockets GitHub under the HRTIL repository, there's a branch called Goldmaster Demo where the test that I show is currently available to check out. So there's an example test in the repo to look at. So if this is my assertion, if I wanna say the homepage shouldn't change without us knowing why, how would I go about writing a test that does that? Well, first we have to prepare. We have to get the production database dump. We have to scrub it with a sense of information. We have to dump it as plain text SQL and we have to check in the SQL. This is the scrub script that I wrote for this test. It's called sanitized underscore production.sql and it touches three tables in the database, developers, awesome sessions, and posts. Developers are obviously our users, so I go through username, email, Twitter handle, admin, and Slack name and I set them to somewhat innocuous values that are still unique for each developer. There's nothing in that table that's really sensitive but I feel it's the best practice to just be scrubbing stuff as much as you can because in a client project, this probably would have sensitive information. We use a gem called author for our sessions management. That is both irrelevant and I don't wanna check it into an open source project so I'm gonna delete everything from there. And then finally, I delete from the posts where the ID is greater than 200 and this is just data massaging. We have like 1300 posts. I don't wanna dump all those into the test database. I only wanna dump 200 in and that is a compromise that I'm making. I'm choosing a faster test and I'm giving up having a perfectly accurate representation of the production database. But I know this application very well and I know that nothing in the application cares particularly if there's more than 50 posts which is the pagination breakpoint on pages that are paginated. So as far as today I learned as concerned, 200 posts is about the same as 1200 posts. So to write this test, we're going to take almost the exact same path from the test that I showed before except the thing that we capture has to be a little bit different. So we start off restoring the data. You don't have to read this because it's similar to my previous slide. We dumped that production data into the test database. So we have the watermelon. Next, we visit the root path. We use the Kappa-Bara method to visit the root path which is aliased to the post path and following Rails convention that is the post controller index action. And this kicks off a very complex chain of events that ends with the browser having today I learned available to look at. So once we've done that, we need to make an assertion about what comes back. And we use page.html, another Kappa-Bara method to assign the entire HTML for the webpage to a variable called page underscore HTML. Once this is done, we work through a similar type of conditional. On the first test run, we can generate the goal master. And on subsequent test runs, we can compare that to previous test runs. Here is that test file. Once again, it's on GitHub under the HRTIL repo. If any HTML changes at all on the page, in any way, this test is going to fail. And I recently worked on a project where I had to come back and work underneath a test suite that had a test like this that I had written. And to my surprise, it was actually a really great experience. The test caught things when it should have caught them and allowed me to develop when I wanted to develop. So that was very reassuring. So let's take a look at a example workflow of a developer working with this type of test. So this is a video of myself working in Tmux a couple of weeks ago. It's just like live coding, except there's no way I can make a mistake. Okay, so a quick orientation. In the upper left, we have the terminal. In the lower left, I'm running the watch command and I'm watching LSAL, which shows all of the files in the spec fixtures directory. So every two seconds, that's gonna update if there's a new file. Right now, the only thing in there is goldmaster.sql, but we would expect it to be a goldmaster.text after the first test run that becomes a goldmaster. And in the right, we have the test. So the first thing we do is to run the test. And if this works, the test should pass because save returns true or write returns true and it will put a new file in that directory called goldmaster.text. Okay, so there it is. So that passed on the first run. And if we run it again and again, if our code is deterministic on any machine that we run this on, it should continue to pass and that will give us that nice virtuous development cycle that we have when we're working in a test harness. So the test should pass in the second run and the goldmaster.text should not change because it doesn't need to change. At this point, we wanna check in the goldmaster. This is an artifact that we will be using for every subsequent test run. So we'll check in that file. Nice short commit message. Okay, so I wanna try to change behavior that was gonna cause this test to break. And there's lots of ways that we could do that. We could change the template, we could change some of the data, we could change many different things. But for me, I think an interesting thing to change would be to change the controller action. So this is the post controller index action. This is the thing that actually creates the homepage. And if you look online, 25, we assign an instance variable, posts. We get our posts with some stuff that's eager loaded. We limit it by a scope called published. And then in line 28, we order the posts by published at. That is the thing that puts the newest posts at the top of the page. And if we were to change that, the goldmaster test should definitely break because all those posts are now in a different order. The titles, the developer names, there's no way that this should pass a goldmaster test. So let's imagine an enterprising young developer comes into the team and says, I think we should order these by likes. The most liked post is much more interesting than the newest post. Well, this should cause the goldmaster test to fail. Fail loudly, and it does. Now, the output isn't great is comparing two large HTML files. You can do a lot of things to make that better. But the conclusion is that the goldmaster test failed. Now, an obvious question is, the goldmaster test failed, other tests should fail too. This is a test with 107 examples. It was developed TDD, so we have cucumber integration tests and RSpec unit test. And surely other tests should fail with such a significant change. So let's test that theory out by running all the other tests. And we'll reset the goldmaster so that it's the way that was on a previous test run. So we're gonna run cucumber and rake which starts off with RSpec, so that covers all of our tests. And fast forward, because it was on a slower machine. Okay, so we had 107 test examples run. Only one test failed. The goldmaster test. And as I mentioned, this was an application that according to code coverage gems had somewhere between 95 to 100% test coverage. Something that we tried aggressively to test from the beginning, and we put a lot of trust in our test suite. And yet, only the goldmaster test failed. All of the other tests said, good job, ship it. So it matters to the goldmaster test. A more important question though is does this actually matter to a user? Well, here's the today I learned site with that change. The poster seeing at the top was published on July 15th, 2015. So somebody who comes to this site today with such a change would think that nobody has published to our site in over two years. This is just the most liked post, not the most recent post. I think it was written before we had the syntax highlighting. So somebody who comes here and sees this is gonna have a really confusing experience visiting today I learned. And only the goldmaster test caught that. This example is a little bit contrived, but I hope it gives you a sense of what a goldmaster test could look like in practice. And this test is not without challenges. It requires maintenance. Anytime the schema changes, you're gonna wanna generate a new goldmaster test. And that's why I would advocate for automating that process as much as you possibly can. It's an investment like any test and you need to put in the time to make it work. It could be slower, but as I demonstrated, you can optimize the data in a way that makes sense. And I think that some people might say it implies correctness. If you're coming and seeing this test, you might think what it's doing is right, but that's an opportunity to talk with your team about what a goldmaster test is because that's not what it is. It's simply saying this is its behavior. I have some future plans. I'm working on a test for today I learned that takes a screenshot of the page on the first test run and a screenshot on the second test run using Caper Bearer Saved Screenshot Method. I'm not the first person to try to write a test like this, but it's been a fun experience. And then compares the two images using image magic, the compare function. And this covers a problem with my example. What happens if the CSS changes? The test is not gonna catch that. It will only catch the HTML. Here's what such an image would look like. So here's today I learned failing on a subsequent test run. And the stuff that is different is in green. The thing that I changed here was I removed a Google font that was included and was shown on every site. And so since the browser didn't have the Google font to use, it tried to use whatever font it thought was best and that changed the way that the header looks and almost every part of the post. So that's a small, tiny little change that someone could make accidentally, but with this type of test, you would catch it. So to wrap up, if you have a mature, complex, and stable application, consider goldmaster testing. It can simulate a much larger test suite if you don't have one to start with. And from my experience, just writing the test is going to tell you some surprising things about your code. And if I could go for a slightly broader conclusion, I would say that applications of the future are going to require creative testing strategies. Many, many Rails applications are in the situation that we've talked about today. Becoming legacy applications with no test suite or partial test suite, and development has to go forward on those. There are new frameworks and ideas also coming along that are continuing to challenge the boundaries of what a test can be. That's what I came here to RailsConf to talk about, and I would love to talk more with people who share that interest. Before I go, I'd like to say thank you. Thank you to RailsConf, to Hash Rocket, to Brian Dunn, Noel Rappen, Sandy Metz, and Jennifer Hart at the University of Chicago. And thank you for coming.