 in production. Awesome. Cool. I'm Aja. I am the Thagamizer on Twitter. I am Thagamizer on GitHub. And I blog at Thagamizer.com. I did not post the slides yet, but I will post them immediately after the talk. I am a bad presenter. And I really, really like dinosaurs, so Pittsburgh has been amazing. I landed at like 2 a.m. after like two hours of delay in Chicago because it was snowing. Going down the escalator and there's a dinosaur there. It was amazing. I love this city. I work on Google Cloud Platform. I am a developer advocate. If you're interested in Google Cloud, Kubernetes, other things like that, I'm happy to answer questions and I have plenty of opinions. But you don't have to ask me. We've got seven of us here. We have a booth down on the vendor hall. You can come say hi. I think we might be out of fidget spinners, but I got a couple that I stored away in my box over here, so afterwards you can get one from me. And we're here because Google loves Ruby. We love Ruby. It's a group of Rubyists who work on our Ruby support. And I love my Ruby community. So victory conditions for my talk. These are the things that I want you to be feeling or thinking when you leave. First of all, I want folks in this talk to feel comfortable with testing and production. I heard the first testing and production and I had a slightly less polite version of Dear Heavens No the first time. But the more I thought about it, the more I realized that this is actually a good thing. And it isn't scary because in many cases you're already doing it. You just may not be aware of it. So in addition to being comfortable with the idea of testing and production, I want you to walk away from this talk with the ability to be a bit more intentional about your testing. So I'm going to do quick definitions. First definition is production. Production is any environment that is not pre-production. Second definition, testing. For the purposes of this talk, testing is what we as developers call verifying our expectations. Yes, I did just use expectation. Yes, I am a mini tester. It's okay. If it makes you feel better, you can think verifying behavior instead of verifying expectations. So we're various. We test all the time. One of the things I love about this community is that we test. We test a lot. You wouldn't dream of pushing a gym without at least a couple tests. Not to give documentation, but we're good at testing. And we have all of our great test frameworks where we set up a scenario, do some verification, and then hopefully clean up a little bit. Call our method under test somewhere in there. These are the traditional tests, so I'm going to call them. But there's also a huge category of black box testing, which is where I got my career started. I was a black box web tester doing manual testing. But you can do black box tests automated and do it at a fair amount now. All of this is still testing, and I'm bringing this up because we're going to use both techniques for testing and production. So why should we test in production? Isn't that naughty? The answer is because a real environment gives you real bugs. You find stuff that you just can't find in your pre-prod environments. For example, production is where you have real user load. While load testing is awesome, and I highly recommend it, most load testing frameworks I've worked with can't actually simulate real user load. Because humans are fantastic entropy machines. So I have the time to tell a quick story, talking to some of my coworkers last week. And I was objecting to a form I had to fill out for another conference I'm doing. And one of my coworkers is like, yeah, whenever I see something I'm not quite okay with, I find a way to hack around it. She's like, I was signing up for a bike share. And for reasons I don't understand, they wanted to know if I was male or female. So I opened up the form and realized that they were storing male as one and female as two. And I managed to convince the server on the other end that I was a four. Humans are fantastic entropy machines. Other things you can only do in production. You can test your integrations. Who here uses a billing service of some sort? Audience participation is okay. Okay. Your billing provider probably has a test gateway or a test API endpoint that you can hit when you're testing. They probably also provide some test credit cards that you may or may not be able to use against their production endpoint. But how often do you point your staging environment at the real production gateway and use a real credit card to run a billing transaction through? Whenever I built something that took credit cards, we did that once or twice before initial rollout. But we didn't do it on a regular basis after that. So if we wanted to test that we were actually integrating with this third party correctly, we had to do it in real as prod. If you don't have a billing service, maybe you have a third party storage. Use some cloud storage. Maybe there are other services like an image processing service or OAuth that you're using. Make sure you're testing those and frequently the only place you can test them for real is prod. Or maybe you don't use third party services but you work on a large team building a huge app and your team builds one microservice and there's other microservices built by other parts of the company. When those come together, that's a seam. How often do you test your seams? How often do you run an integration test across all of this? One of my most frustrating moments at a previous job, we were three days before big GA, we're going to go actually to production with some new stuff. We had two teams, client server. We hadn't tested that the two pieces could talk to each other. Out of curiosity, I spun it up and the first thing it did was crash hard because the person who was orchestrating and running the client side team and the person who was orchestrating running the server side team had a misunderstanding in the protocol that they had developed between the two and so it just exploded. You have to test your seams. I would hope you test them before production but sometimes sneaky bugs can get in and testing your seams in production is also valuable. Who's heard the term hyzen bug? For those of you who don't know it, hyzen bugs are those bugs that can only be produced in production by that one really important client at that one really important company. Maybe it's an artifact of their network or their browser or some sort of security thing they have but testing in prod allows you to find these. Same company as last story, we had a very important client who's like so it mostly works but like if we do this one thing, something weird happens. It doesn't crash but it doesn't seem right. It's been about a month and a half debugging with them remotely and finally we picked up laptops and went and took a site visit about an hour away. We get there and they're like well what are you planning on doing? We're like we're going to run a network speed test because it appears that you aren't getting the full download and they're like oh that's not going to work because we cut off any download greater than a specific number of kilobytes. We're like uh oh but we wouldn't have found that unless we had been testing in prod. So the second thing about testing in prod is I heard this at a meetup about nine months ago. My favorite meetup in Seattle. I live in Seattle or be coffee ops. Someone's like hey I want to talk about testing in prod today. Awesome. What's testing in prod? And this person goes on and I'm like I'm thinking I'm going to learn new stuff here. I'm talking about monitoring. They're talking about logging. They're talking about tracing. They're talking about blue green deployments and canaries and I'm like oh there's nothing new here. This is stuff that's been often in common use since the 60s in many cases. Everything I talk about today is techniques that I have seen in use since I got started in tech in 2002. So I guess that means I'm old now. So preemptively I'm telling you all to get off my lawn. So I talked a little bit about the background but I haven't talked about the how. So to keep myself on track because I'm going to talk about a lot of techniques. I'm going to throw a lot of words at you. I'm not going to give you a ton of specifics but I'm going to give you enough that you know which ones are interesting and know what to search for if you want to find out more. I've divided this talk into four sections. Deployment testing. User focus testing. We're using tests and my favorite one implicit testing. Let's dive in. Deployment testing. So the first technique I'm going to talk about is canaries. Canaries is just a phased rollout where you roll out your release gradually to some of your servers at a time over a course of minutes, hours, days, or even weeks. You have a subset of your users or a subset of your servers that's going to receive the new code. Once you've rolled it out you monitor vigorously for things like error, memory, disk but you also might want to monitor for user-based metrics like free trial conversions or purchase path completion. If everything is thumbs up you expand the canary group and you keep doing this, release a bit more, monitor, expand until you've rolled out your new release to all of your servers. So that's all great but how do you choose your canary group? So you can use internal users. Sometimes we call this dog fooding. You can push out to people who don't have a choice but to use your new version and find all the bugs in it. You can just choose randomly. I'm going to choose, you know, I've got 600 servers, 600 containers. I'm just going to choose some of them. You can do it geographically. This is how a lot of folks do it. They're like, okay, we're going to start with a small percentage of the servers in US West. Then we're going to do that entire data center. Then we're going to go to US East. Then we're going to go to Europe. Then we're going to go to Asia. You can do it based on your demographic. Maybe you only want to roll this out to users who are new or users who log in 18 times a day and you're not quite sure why they're using your product so much but yay. You can also ask users to sign up to find out to get access to stuff early. We're going to get into that a little bit but you can also use it for canaries. The cool thing is you can pick as many of these as you like. You can use any sort of slice and dice combination so that you, the goal is you start with a small group and you roll it out gradually to make sure that whatever you're doing is not toxic, does not take down your environment. Second deployment strategy is blue-green deployments. All you have is two copies of prod. Two copies, one is blue, one is green. In this case the blue is live and the green is idle. One is always live, one is always idle. When you want to roll out new code you deploy it to the idle side, in this case the green. Once it's up and running you have your new code on your idle, your old code on your live, you start routing traffic to the new code. So now we end up switching live and idle and you've done your deployment. Nice thing about this is if something goes wrong is an easy rollback because you have the previous known good version live just a couple minutes ago so you can just swap whatever router rule you did to move your traffic and move it back. It's also depending on how you do your blue-green might be really good for disaster recovery. If your blue and green are in different parts of the same data center and you have a partial data center power outage which I have been through multiple times you might be able to move traffic back to your other half of prod because you've got two copies of everything. It's fantastic. Having two copies of everything though is not always great. Doing databases basis with blue-green deployment is kind of a pain. So don't use databases or maybe just leave your databases out of your blue-green clusters. If you want your databases to be part of the system you can do things with snapshotting and replication but depending on exactly your database setup and how good you are at setting up your databases and how often you're writing you may have a little bit of a blimp as you flip as you flip which one of your databases is the replica and which one is running is the main one. Or you can use a non-relational database. A lot of the problems with relational databases and replication and stuff are solved if you use a non-relational database. Non-relational databases are awesome. When I started my career we used a variation of this technique and we did rollouts. We divided our server cluster in half. A half and B half we did and we would deploy to A. We would test it behind a firewall. Once it was good we would route all the traffic to A and then we'd deploy to B. This was not true blue-green though because we couldn't actually run the site successfully at peak load on just half of the cluster. We had to have at least two-thirds of it up so we could only use this technique late at night and we didn't have the hot swappable backup at all times. But testing and prod do what works for you. Both of these techniques plus many of the others I'm going to talk about work well in conjunction with auto rollback. In auto rollback you have some predetermined metrics and if you ever hit you know those thresholds condition is tripped and your deployment system automatically rolls back to a known good release. To do this you have to make sure stuff is scripted but I'm hoping most of you have scripted your deploy at this point. When I started I was releasing based on a 34 point printed out checklist so I hope you guys are doing better than I was. And if you're going to do auto rollback you need to be very conscious and careful about your data or database migrations. If you go back will you lose important data. Will the old code actually work against the new schema. Things you need to consider. Is anyone still using session affinity or sticky sessions? Okay I figured there were still a couple of us out here. If you're using WebSocket it's really hard to get around actually. Sticky sessions and session affinity if your user has to hit a specific server because that's where their connections established. How are you going to deal with that when that server goes away. Things to think about. The biggest thing you can do is you can separate your data migrations from your code pushes. Push the code make sure the code can work with both versions of the schema. Then do the data migration. Once everything stable then push code that can only work with the new version. It's a really common pattern lots of us have been doing it for years. Again get off my lawn. But it's an important thing to know because it's not the way you're taught when you're doing Rails as a newbie. Second section. User focused tests. These are things that test the user experience and you're like I'm a developer that's not testing in prod. Totally counts. You're just testing something in the underlying stability and correctness of your code. Who's done AB testing? Hey people test stuff in prod. It's fantastic. AB testing is just an experiment. You have a control group and you have some number of experimental groups. You run the users through different experiences when you have enough data to be statistically valid. A lot of big numbers and all. You figure out if there are significant behavioral differences between the groups and you decide which one you're going to go with. Different than Blue Green because both are live at the same time. Blue Green remember one is always live and one is always idle. But in an AB test they're both live at the same time. Which means you have some interesting things with data integrity. Another way of doing user focused testing is betas and EAPs. For those who haven't heard the term because I haven't before I started working at Google. EAP is an early access program. It's like a beta but usually before a beta but not an alpha. And these give you an ability to test your stability and more specifically the usability of something you're about to push. Because nothing finds edge cases. This is the way users find edge cases. But it's important if you're going to one run of these programs that you give users enough time. I know folks are like we had a beta for like eight whole hours. No not a beta. You need to give people multiple weeks in many cases so that they can use your product over time. Make sure that it works for all of the scenarios they do. Not just kind of glance at it and say hey I like the new colors. And you need to make sure that your expectations are clear. If there is an expectation that someone who participates in your EAP is going to give a specific amount of feedback. You need to make sure that's clear up front. You also need to tell them where the known issues are. Every beta has got some edges. Some places where we know stuff's broke. Tell them about that ahead of time. Because you don't want to actually have 19 bug reports that are the same. Third section. Reusing tests. So there was a fantastic talk on Monday about checkups. And this is similar to that content. But I have stories that are different because I've done it as well. The really thing I like about this is that each and every one of you can do that. Running a usability test or beta is going to require cooperation of many other people. Changing your deployment process unless you hold the keys to deployment is going to require cooperation of many other people. You can do everything in this section without talking to anyone. It's awesome. So the big thing is to run smoke tests against production. It's another story. I was working at a relatively large company that was not my current employer. And I had been doing manual testing, but I got permission to start doing some basic automated testing with a really, really clunky record playback tool. Record playback tools make really, really brittle tests. But you know, better than nothing. I'd rather not run that same test manually 15 times a day. And I was sitting there one day and I'm like, hey, I've got these extra servers in my office because I was running the test lab because I used to get cold, so I would make warm. And I'm like, hey, I could run these smoke tests against production, right? I would hope that they never fail. So I set it up, set them to run every four hours on a cron, set it up so it would email me if it failed, and then, you know, let it go. And it worked for a couple of days and I was really excited and then I promptly mostly forgot that this was happening. Come back from lunch one day about two months later and I have an email saying it failed. I'm like, there's no way it failed. Like if this was actually down for 30 minutes, someone would have noticed. So I go run the test manually and actually it had failed. One of our suppliers was not sending all the information we needed to us when we made a request. And normally monitoring would catch this but something along the lines of they were sending back a response, just a response body was empty. So we were getting 200s not failures, meant that it didn't get caught by normal monitoring. But it did get caught by this test. So I'm like, hey, is broken. And we managed to contact the third party that we were using, have them fix their thing, make sure our stuff was still working and managed to do it all within a couple of hours before anyone noticed. Because we wouldn't have caught this bug without a user notice unless I'd been running these smoke tests that I normally use for releases and day to day testing against production. And no one knew that I had set that up. I just, you know, had a server as well. I've been using the term smoke test for folks who don't know smoke test is a super simple test of the core functionality of your product comes from the idea of whether smoke, there's fire or if this fails something is on fire. And I personally believe that even really big complicated products will have relatively few smoke tests everywhere I've worked, we've kept it under six. I would imagine almost everyone can keep it under a dozen. Because you just testing the very basics. So if you're gonna do this, pick a subset of your existing tests, you probably have something that you would consider smoke test in your integration suite already, just reuse it, set it on a schedule every n hours once a day, once a week, whatever makes sense, focus on things like your third party integrations and absolute core functionality of your product. And I'm gonna point out here that if you use something like the VCR gem when you normally run your tests so that the tests run faster and you don't make requests against a third party, consider not doing that when you're doing these tests against production. Because you're not actually testing your integrations if you're faking out the integration part of it. And the big thing is leave no trace. Your tests absolutely leave no trace. Ideally you want them to clean up after themselves. Because of this most of my smoke tests don't do purchases. Everywhere I've worked that I've been involved in doing the database schema, our purchase database has not allowed updates or deletes, it's only subsequent writes. So if I did a purchase I couldn't delete it. If you have a system like that, make sure that you have a way of not doing purchases or if you do purchases you can flag those because you don't want, I've been running this test every minute and all of a sudden we are making tons of money to show up in your reports. Onto my last section, and I'm going to call this next section controlled breakage. So basically controlled breakage you want to purposefully and deliberately break various parts of your system. Take servers down. Pretend that the disk went bad. Pretend that your network pipe got really, really small. And what are you testing in this case? You're testing your ability to respond and recover. Is your system supposed to be self-healing? Does it? Or is the person carrying the page you're supposed to detect these types of errors and address them? Do they? I really like this testing. I did start my career in test. I fundamentally love breaking things. It is fantastic and wonderful and it is one of my favorite things. So the first time I got permission to do this I went nuts. I found all sorts of stuff that was completely and utterly busted. Riding up all these bugs and the nails are coming back as won't fix. Won't fix. Won't fix. Because just like security, durability is something where you can never be absolutely durable. You can be more durable or less durable, but it's always a trade-off between durability and the amount of engineering time you want to dedicate to it, which is a proxy for cost. And it doesn't make sense to be durable against four lightning strikes in a row that hit your server directly because it's not a realistic scenario for most people most of the time. So what you're going to do is stay in scope, stay in the scope of stuff that makes sense, stay in the scope of stuff that you and your team have agreed should be, you should be able to respond to. I can't talk about control breakage without mentioning Netflix's simian army and chaos monkey. They're open source, go check them out and they're cool. And we actually do this kind of testing, this control breakage testing at Google. We call it dirt, disaster recovery testing. I have not participated as an engineer in that process, but I found a fantastic talk that if everything worked correctly, it should be tweeted under my Twitter handle already. And you should go watch it. It's by one of the SREs who started the dirt process and has some fantastic stories about things that they accidentally and on purpose did to test the disaster recovery and durability of Google. Related is penetration testing. Who's been able to do some pen testing? I got to do some about six months ago and it was awesome. I got a week of time and I was packing against stuff. It was fantastic. It was really, really fun to put yourself in the mind of an evil adversary. You got your curly mustache, horrible hat thing going on, doing some rock and roll stuff there. And it's totally another form of controlled breakage. You try to figure out kinds of mistakes that you likely have made and figure out if you've patched against them. But since we're talking about this, I'm going to talk about the fact that controlled breakage needs to be ethical breakage. DHH touched on this in his keynote that we have the power for both good and evil. Make sure you're using your power for good. Think carefully about the potential impacts of your choices on your users, on your company, on your job. Want to make sure that the choices you are making are reasonable and ethical. And every time I've worked on penetration testing or talked to folks about it, there's always rules of play. Frequently, for big exercises, there's also a proctor who can make sure that you are playing fair and you are being ethical in what you're doing. My last form of testing in production is disaster recovery and verification. Who has a DR plan? Who's tested it in the last year? So congratulations. You guys have successfully tested in production and you're doing better than the vast majority of the audience. So the disaster recovery is when you make a plan for your data center catching on fire in a way that you can't predict. I did a talk at Ruby Conf in Cincinnati about the time that they were replacing pieces in the power conditioners, the power conditioners that the data center caught on fire. We were down for an hour and then we ran on diesel for 11 days. It was an error that the supplier of the power system had never, ever seen before. It was not supposed to be able to happen. So therefore disaster. Disaster recovery is how you're planning on dealing with things like that. And this is for real needs to happen in production. If you haven't tested this plan in production, you haven't tested it because by its very nature, your disaster recovery plan is for when something bad happens in production. You need to move traffic to another cluster. Maybe you need to move data between data centers. Maybe you need to restore databases from a backup. I accidentally deleted, well, I accidentally corrupted a production database at 11 PM at night once because I ran the feature branch migrations instead of trunk migrations against it. Yeah, that was fun. Luckily I had taken a database backup right before I did that. So I was able to restore from the backup. And I knew how to restore from the backup because I had actually been practicing that on a regular basis. So I was able to do it without thinking because I was freaked out. I was like, oh god, they're going to fire me. They're going to fire me. They're going to fire me. But it all ended up being OK. As part of DR, you want to make sure you're testing scripts, all the scripts that do network migrations, database restores, all that. But you're also testing your people. And everyone's like, testing people isn't testing. There's my mini test test for testing people right there. So implicit testing. This last section of calling implicit testing, I originally was going to call it passive testing, but I didn't like the way that sounded. This is the testing that you're all already doing, but you don't actually think of it as testing. So that's a stack driver monitoring graph of memory usage on an internal app that I work on at Google. That's a spike. And if this was actually a mission critical app, I would have hoped to be alerted to that spike. Luckily, if it goes down for a couple of days, we don't actually care that much. But I'm talking about monitoring. What's monitoring have to do with testing? So who's got monitoring? Raise your hand. Please, most of you. Thank you. Who has alerts on their monitoring? Turns out alerts are tests. Think about that for a minute. We think of alerts as the thing that tells us that something is wrong. But if we massage the English a little bit, they tell us that the system isn't meeting expectations. And back at the beginning of this talk, I defined testing as verifying that your expectations are met. So by definition, alerts are testing. Still don't believe me? Say I have an alert if latency is greater than 500 milliseconds. There's my test. And if you're going to be doing your monitoring, too many folks I know just look at the system for a couple of days and say, yeah, this is what it's supposed to look like, and set up their alerts based on that. I encourage you to take a step back and think about how you want your system to be working. Think about the kinds of behavior that you need. Maybe you have an endpoint that's hit 90% of your traffic goes through that endpoint. That one should probably respond pretty fast, huh? Maybe you want your error rate to be less than 5%. Set that test up. Or maybe you think that your disk should never be more than 80% full. Set that test up. We just call these tests alerts. And the variation on this is looking at month over month or year over year trends so that you can actually answer questions and make assertions like our error rate should not get larger and our site should not get slower. Here's a screenshot from Stackdriver Trace of the same app doing a, I believe this is a month over month, is actually a year over year comparison, I believe. And you can see that it's bimodal. And depending on which one the blue is, new or old, it maybe got a little bit slower on the far end where responses are slower. But it mostly looks the same. So I feel pretty okay that my assertion that behavior has not actually changed. My expectation that behavior hasn't changed is actually a valid expectation. Again, because I was having fun with this, you want to assert that your old one and your new one are the same. Or hopefully that your new error rate is less. So I've thrown a whole bunch of thoughts at you ideas, words, I'm gonna give you some basic dos and don'ts. So at the end there's a cheat sheet so you don't have to take pictures of every slide, and I will publish the slides. So do you have clear goals? You should go into this intentionally. Figure out what your goals are. Figure out what your expectations are and start from there when you're picking what you want to monitor and test in production. Don't DDoS yourself. So I was doing a disaster recovery test. We took down a server that was holding a bunch of web sockets. We're like okay, we're supposed to reconnect. So they reconnect to the fallback server. And the fallback server promptly falls over because it wasn't capable of handling that many simultaneous reconnections. And so it falls over and so the clients start trying to reconnect as we bring it back up. And it falls over again. We got into a cycle of fail. So we learned things but in the process of our disaster recovery testing we accidentally DDoS ourselves with our own app. So don't do that, is bad. Think carefully about the possible impacts of the tests you're about to do before you do them. We talked about it before but test your seams, test where your stuff integrates with the people who sit down the hall or the people who sit on the other slack if you work remote. Don't mess with user data. No, no, no, no, no. We do not mess with you to your user data. We do not view user data unless we have a really good reason. If your company does not have a user data access policy you should do that thing. It's the right thing to do. Keep your tests as walled off as possible and make sure that they aren't considered user data as well because you don't want that data corrupting any of your other reports. And do clean up after yourself. Do the Girl Scout thing, leave no trace, be a good citizen. This is one of my soap boxes. Alerts should be actionable. So if you're using alerts as a form of test, awesome, but make sure that a test that isn't urgent does not page someone at 3 a.m. If they get this page and there is nothing that they can actually do other than go back to bed and deal with it in the morning they shouldn't have been paged in the first place. It's the way we get ops burn out. Just don't do it. Verify your integrations. After that experience in my first job where I found the bug that we hadn't found just by running some of the test against production I now trust but verify all of my third-party integrations on a regular basis because they can fail. And more actually common than the third-party failing is they updated their API and you didn't actually get the email and you were using VCR so you were getting the old responses and then glue. Make sure you're testing. And the big one is whatever you choose to do act methodically. Make sure you are doing stuff with a purpose and a plan so that if something goes completely wrong you know what you've done and you know how to undo it. Here's your cheat sheet. Have clear goals. Test your seams, verify your integrations clean up after yourself. Don't de-doss yourself. Leave user data alone and keep alerts actionable. I'll say thank you and get off my lawn. The question is how do we handle off against production servers? So the way I've always done it is I've created the magical test account and the nice thing about that is that everything that's associated with the magical test account I know to ignore. I worked at a place where we used a specific last name for magical test accounts. It started with five Xs so we hopefully wouldn't pick up anyone's real name in the SQL queries and that's because we wanted to part of our smoke was testing sign-up so we had to create new accounts. There are other ways to do it. There are tools you can do. There are companies that actually offer production testing services. Do the right thing for you. You're already testing in production so you might as well do it on purpose. Okay, thank you all.