 minor disasters. I'm Aja. I like it when people tweet at me during my talks. I'm the thagamizer on the Twitters. So I submitted this talk to a track entitled War Stories. And the track didn't end up happening. But I know we've all had that day where we broke the Internet or everything went wrong. Lots of folks seemed to have a big red button story. I was at a DevOps meetup in Seattle and I'm like, hey, tell me all your war stories and the number of stories that started. So this one time, the CEO came to the data center and asked, what is that big red button for? And then the story progresses and eventually the entire data center goes up in a pile of flames. Some of my favorite war stories start. So do you know how the fire suppression system at a data center works? Because I do share with you today some of the boneheaded things I've done and some of the really odd situations that I've been in and how they've changed how I write software and how I build the teams that I work on. So let's all gather around the virtual campfire because it's story time. So once upon a time at midnight, I was doing a release for a startup I worked at. We did our releases at midnight because we couldn't do migrations and keep the site online. So we take the site down and we do the release and then we bring it back up. This happened to be the first time that I was doing the release completely solo because my secondary on call was on the Trans-Siberian railway somewhere in Russia or Mongolia and I'm not even kidding about that. So he was completely unavailable to me and I was pretty junior at this point. I think I'd only been in the industry for about three or four years. But luckily, my backup, the person who led our team, was a military man, former military and he believed in being highly organized. So he had left me with a 30 plus item checklist that was basically the pre-flight checklist for a release of our product. Every single step had to be done in order. You had to print out the checklist and check them off while you did it, but every single step had to be done manually. And it turns out that that takes a long time because the first thing you do is you tell the team that you're going to start the release, then you put the maintenance page up, then you notify the team that the site is now down, then you put the new code on the server, then you run the database migrations, then you restart the servers, then you start manually testing, then you tell the team that you're going to bring the site up, then you bring the site up and then you manually test again, then you tell the team that the release has been completed and then you watch it for 15 minutes and make sure you didn't blow anything up. So I get through that process and I get to that step where we're going to do the initial set of manually testing. And I go onto the page and I see this. So something was wrong and it wasn't all pages that were thrown on the standard Rails 500 page. Some of them were just showing assets that were missing. Some of them were rendering in really odd ways. And it took me a couple of minutes and I looked at the logs and I realized, oh no, I pushed master. I didn't push the release branch. Which, you know, that would have been fine because I could have just pushed the release branch on top of it, would have taken five extra minutes, brought the site back up, would have been fine, except I'd run the database migrations. So I had effectively corrupted our production database at 1 a.m. on my very first release solo while my backup was in Mongolia. So at this point I start freaking out. I start I amming some of my friends from the Seattle Ruby community. Oh my god, oh my god, oh my god, oh my god, what did I do wrong? And they're like, you know this. You have these skills. You've done this before. You know how to fix this. I took a deep breath and remembered, oh right, one of those 30 plus pre-flight checklist steps was to take a database backup. So I have a database backup. And luckily, the other thing that happened was because I was working as a QA engineer at the time, one of the things I did on a regular basis as part of my day-to-day job was restoring a backup to our staging server. So I knew exactly how to restore the backup. I could type those commands in my sleep. So I'm like, well, if it works on staging, hopefully it'll work on production. Started the restore and that took about 45 minutes. So I had a good period of time to get my blood pressure down, my pulse rate down, and to pace the circle that was my apartment at that time through the living room through the kitchen, through the living room, through the kitchen. Well, I hoped that everything would work out. And I pushed the correct branch, ran the correct set of migrations, brought everything back up, and it was fine. And I emailed the team saying, the release was successful. And I get an email back from my boss. He's like, we were down a lot longer than we should have been. What happened? And I'm like, so here's what happened. Here's what I did. He goes, we're going to talk about that in the morning. And I'm like, but I was actually feeling pretty good. Because as part of my pacing, I had remembered that while this was the only time I had made that particular error, I had worked at several other companies where someone else had made that particular error, including my very first release at a relatively large software company, in my very first job out of college, we were down for two and a half hours longer than we intended to be, or some period of time, because someone else had made almost exactly the same error on that release night. So my lesson learned. Automate everything. People make stupid mistakes at midnight. I make stupid mistakes at midnight. And I wouldn't have made that error if that lovely checklist had just been automated as a bash script. And it would have been easy to do that. And the other thing is if you're going to automate your releases, you should automate your rollback as well. And that probably means you need to write your rollback migrations, your down migrations and rails. Because we didn't do that. Why would we ever rollback? The other thing is always have a backup. If I hadn't had a backup that night, I would have been having to write code on the fly. And I technically wasn't a dev at that point. To undo the migrations that we had just done. So having a backup is totally useful. So I claim that this talk is about data center fires. So here's a data center fire. Once upon a time, someone sent a message to our pager. Customers were having a hard time going through our checkout process. We tried it out. And we got the error that said we're sorry. We have been unable to charge your credit card. Please contact your credit card provider. And we happened to know enough to know that that was the error we showed no matter what error we got back from the credit card processor. So we dig into the logs. And we're seeing that our site is getting a timeout trying to process cards. So our assumption is that we broke something. So someone goes, well, why don't we try processing cards manually? Our provider had a website where we could deal with cards if someone did a phone in order. So we did that. And that's the page that showed up. Timeout error. We're like, well, that's not good. So someone's like, well, let's call the provider. And we were put on hold. And a couple of minutes later, the call disconnects. And we're like, well, that's really not good. And it happened to be through a series of odd events that the provider we were using was across the street from the company I was working at. And they were on approximately same floor of two high rises in Bellevue, Washington. And if we squinted and tilted our heads just ever so slightly, we could see into their network operations center. And so I, you know, smashed the face against the window and look, and we see a lot of red. And we're like, that's really not good. So someone had a brilliant idea to check the news. And we find out that a relatively large data center and news facility in Seattle, Washington, it had a fire. And 19 fire vehicles had responded. And when there's a fire, the power had gone off. But the generators kicked in as they're supposed to do. And then the fire department shows up and is like, it's an electrical fire. You don't get any generators. Sorry. And turned the generators back off. And the entire facility had gone offline. Taken with it, two radio stations, a television station and four or five colos in that building, including the one that our credit card processor was in. So here's an actual picture of the damage from that fire. I was able to find via some news sources in Seattle. And luckily for us, once we figured out what the issue was, it was pretty easy for us to fix it. Because due to some experiences I had had at a previous job, I had insisted that we were had a way to turn off the store so that we could turn off all credit card processing, all renewing of subscriptions, all free trials with credit cards, everything. And the site would keep running and everyone would keep having a good experience. We basically put all accounts into free play mode while the store was off, which was great because this particular fire happened over a holiday weekend. It happened on July 3rd, I believe. And while the fire department got the fire out and they started bringing back some of the facilities, the fire inspector had to check every single colo, every single duct, every single connection before given sections of the facility can be brought back online. And then our credit card processor had to bring all of their infrastructure back up because they were only in that data center. So we ended up not having credit card processing for four days. And if we hadn't built the system in such a way that renewals were going to be fine, free trials were going to be fine, everything was going to be fine without the store running, we would have been in a really big world of hurt. So this cemented my strong belief that you should make sure that you can gracefully fall back if any of your external dependencies fail. And you should preferably have a way to activate that fallback without having to redeploy your system completely. We happened to have a console page, you logged in with an admin account, you went to a specific URL, there was a checkbox, turn off store, hit submit, everything was fine, everything just picked it up. That was good, it meant that solving the problem was fast, but it also meant that a couple weeks later someone accidentally clicked that box because they thought they were on staging and they were testing something and we ran without credit cards for a couple days too. After that particular incident, we added some really obnoxious colors to the admin console on production so that you could not miss the fact that you were on production. So title of this talk is data center fires, plural. There's another fire story I have while working at that same company. When we had the first fire with our credit card processor, we took it as a chance to check and see if our systems were hardened appropriately. If we were working correctly against the eventualities like this, we decided we should upgrade. We were in a great facility but we wanted one that had a little bit better of a track record. So we moved to this really nice mom-and-pop colo. Very friendly people. One of the best Christmas trees I've ever seen. It had ramsticks on it and peppermint sticks. It was great. And they sent us an email saying that in about a month in the future they were going to have to do some mandatory routine maintenance on the APUs and the power conditioners. Basically all the equipment that takes line power and puts it into some form that will actually work for all the electronics that are running in the colo. And we're like great, we trust you guys, you're awesome. And the appointed time came and they sent an email announcing that they were on generator power. We had not noticed any significant blip and we had made sure that we did have battery backups in our rack in case there was a momentary issue. And things ran great for about two hours. And then suddenly we went down hard, hard. And the colo sent us a mail saying that there had been an incident during maintenance. And all power to our section of the facility had been cut off. All personnel had been evacuated. But they would try to get our rack in the rest of the racks online within the next hour. And if you know anything about how colocation facilities work or data centers work, if the words incident and all personnel have been evacuated or in the same sentence, it means that something caught fire and the halon was activated. Because when you use a halon fire suppression system, everyone has to leave for like 15 minutes while the gas dissipates. And we had done enough maintenance there and we had moved our racks in and we had been told if you hear this sound, you will leave. And we're going to have someone standing next to you the entire time you're in the facility who will drag you out if you choose not to leave that we knew what was going on. So I looked at myself, I looked at the other person I worked with on the infrastructure side and we're like, well, we got to go fix something. So he grabbed the go bag headed down to the data center. He was going to bring things back up as quickly as possible once we are going to be let back in. I'm like, Hey, I can show up, I can help. He goes, no, I actually need you back here at headquarters to deal with everyone else in the company and also to check things from this side. And I'm like, well, I can just go down and you can do that. Like I can just go turn switches on and it's not hard. He goes, no, no, actually, stuff has to be brought back up in a specific order. You can't just turn all the switches back on and have everything magically start working again. And I'm like, is that order written down anywhere? Is this process documented? And he's like, no, no. And so we realized that we had a knowledge silo, a pretty significant one. Because if he'd been on the Trans-Siberian Railway when this particular incident had happened, stuff wouldn't have actually gone very well. So we got everything back up. We were only down for about an hour and a half. But the color wasn't nearly as fortunate. They sent us a couple pictures about three hours later of the parts that had caught fire. It was actually a picture of a fried circuit board. It was kind of impressive. And I don't have that picture otherwise I'd share it with you. But the comment attached to that picture was, our vendor says, we've never seen anything like this before. And they ended up having to run on diesel for 11 days before they were able to get the parts that had fried into stock and come up with a process, a procedure for replacing this part. Because this was not a report that you had to replace. It didn't happen this way. And we used this as a wake-up call. We re-architected our rack because we had all the databases and all the switches running on one battery and all the servers running on another. And we realized that it made more sense to split the functionalities across the battery so we could bring up half the rack and have the side up and then deal with the other half of the rack as opposed to having to bring everything up to have a valid site. And we also decided that we shouldn't have any silos. So we talk about pairing on our programming. We talk about pair programming and making sure everyone knows stuff. We talk about doing code reviews. But what we need to do is focus on things like infrastructure pairing, deployment pairing. Everyone needs to know how to do everything. And maybe not everyone needs to know how to do everything, but you need to have at least n plus one. You need to get your bus number greater than one. And so we started pairing on the hardware. We started pairing on the infrastructure. I got to come down and do that rewiring of the cabinet where we moved things around so that I knew how stuff worked. And I got the second set of keys to our rack so that there were two of us who could get in in case something went wrong. The other important lesson I learned was that you need to have a disaster recovery plan in preplace. And you need to practice it. We were down longer than we should have been because we hadn't practiced bringing the site back up from completely down. We had done it before when we had moved colos, but the site had gotten more complicated and the hardware had gotten more complicated since then. And finally, I use the cloud now. I want things catching on fire to be someone else's problem. And I like to work with cloud providers who will proactively move your workload out of a section of the data center that's going to have maintenance. And I happen to work for one of those. The way we handled the most recent hurricane scare on the east coast was really awesome and that we took care to make sure that no one was going to have problems. Also, multiple regions, have your stuff in more than one place. Redundancy is great. So this story isn't one of mine, but it starts with a phrase once upon a time in Japan. So I work on a team with a lot of developer advocates at Google and we have this interesting demo that we like to take to events called Cloudspin. And this is what it looks like. It kind of uses a bunch of phones and a big metal pile of tubing and some stuff to take pictures and then it stitches them together matrix or crouching tiger style. And it's really popular. And so we take it to all sorts of events and we're going to take it to an event in Japan. But if you've ever tried to take significant numbers of electronics across the international border, it raises eyebrows and many times it's just not worth it. So what we choose to do instead is as much as possible we buy the gear locally. So we had someone in the Japanese office were like go buy 30 of this particular phone. Here's the model number, here's the name. We'll bring the parts that are custom, the big rack, but you need to go buy the phones for us. And that was great. Until we got there the night before and the team is setting this up and they go to plug stuff in and they realize that the phone with this model number and this model name in Japan has an entirely different set of connectors than it has in the US. And the stuff they brought doesn't work with the phones that they now have. So we're in Tokyo. You can buy electronics in Tokyo. This is not actually a surprise. So they head out to the electronics store like we'll just find the adapter between the US version and the Japanese version will buy 30 of them it'll be fine. But three hours later they haven't found any. So team meeting should we just cancel the demo? Someone's like nope. I know how to solder. So the team ended up staying up late that night. Soldering trying to get the connectors to all connect and they ended up pulling it off. And they got it done and the demo went over really well. And I learned a lot of lessons from the next team meeting where they told the story. One of which is if details matter, don't make any assumptions. They made an assumption that the same phone with the same name would have the same type of connector in different countries. Seems like a safe assumption, but it wasn't and that detail was important. So they should have had someone take a picture and send it to them before they got all the gear over to Japan. Also have people on your team with diverse interests and hobbies so that you have someone who can save the day by soldering. My team is really cool. Lots of very odd and interesting people including several people who are very into quadcopters and electronics and electronic music. And so the question wasn't who can solder. It was who's the best at soldering? And they set up an assembly line with two people soldering irons which they could totally buy and two or three more people who were setting up things with the wires all laid out exactly how they needed to be laid out for the people who were soldering. So my last story is that sometimes you're your own worst enemy. So once upon a time I was working on a client server web application. It was specifically using web sockets and I'm going to fudge some details and like many of my stories, most of the story is true but some some details have been obscured to protect the innocent. But it's important for this one that you know that this particular web socket application needed between 30 and 60 frames a second were messages a second. Otherwise it would try to reconnect or the experience would significantly degrade. And that was because we were doing animation with web sockets and there's a whole pile of lessons learned here that basically start with the phrase don't but we were and it was mostly working. We were in beta we had some customers and it was great. But one day we're having our weekly retrospective talking about what went well and what went poor the week before. And we get the spidey sense going we check the site and stuff's not looking right. So two of us were like CEO take our beers go back to our desks and start trying to dig into it. And we see that traffic and memory usage was spinning wildly out of control and the servers were shutting down and then restarting really. Well that's not good. So we spent about 30 minutes debugging and eventually we're like. And we hit the big red button shut everything down bring everything back up and things went back to normal and we spent the rest of that week adding some additional logging so that if it ever happened again we'd be able to figure it out but it was a beta product. We didn't actually own all the client software stuff like this happened occasionally it wasn't a big deal. And so I went on vacation and while I was on vacation the same thing happened except this time after restarting everything it fell over again and again and again. And the system ended up coming back up safely but only toward the end of the day when traffic would have lowered anyway. And I came back to a team that had spent about three days debugging and I brought in a different set of experiences they had been trying to load the logs several gigabytes of logs into a Mongo database and I started trying to text process them and make a timeline of what had happened and draw visualization of we got this request and then we got these requests and then we got these requests. And between me having fresh eyes and not having dealt with the emergency and them having had some time to eliminate all possible other things that could have gone wrong we realized that what had happened was that a malformed socket message had been saved to the database and then caused the server to go into a bad state. As a result of the server going into a bad state they hadn't gotten the frame rate or the number of requests they expected and they got disconnected. So they tried to reconnect but they couldn't reconnect to the server they had just been contacting because it was in a bad state. So they kept trying to reconnect harder and harder and harder and eventually they did get reconnected to a different server and the server would be like great you're reconnected here's all the messages you missed while you were offline and it would resend the bad message which would take that server down which would then take all the machines that were clients that were connected to that server down with it and then they would all try to reconnect and reconnect and reconnect harder. And the short version is we DDoS'd ourselves by trying to keep the connection to the server alive. And my comments originally said I'm sure others have done similar things but I know that others have done similar things because who actually read the post-mortem the public post-mortem from the DNS out a couple of weeks ago? Yeah I was expecting to see a couple of hands. So if you read that you notice that one of the things that made that worse than it already was was the fact that the way DNS works if you can't reach the DNS server you're trying to connect to you go ask your friends hey can you reach this guy? Which then exponentially increases traffic so in addition to all the malicious traffic there was an exponential increase in valid legit traffic that they were also dealing with. So the very way the DNS protocol is written actually made the situation worse and effectively made the DDoS that was being perpetrated against them even worse. And the moral of that story is that you're often your own worst enemy. So make sure when you're designing your system think about all the ways that you could break things think about all the ways your own code can take down your system and then harden against them. Also incremental back off is a fantastic thing. So I've got a couple minutes left and what I really wanted to emphasize by doing this talk is that we all mess up. I'm sure if I asked people in this room to raise their hand if they've ever broken their site or taken down the internet or let the blue smoke out of their computer they could raise their hands. I know that my claim to fame is that I once dropped a Mac in a bathtub. It was fine by the way. But what saves us? Lots of things save us when things go wrong. One of them is trust. I couldn't have gotten through these situations if I didn't trust my coworkers and trust my tools. And if you can't do those things you either need to fix yourself maybe by learning your tools or need to fix your situation so you have different coworkers. Also what saves us is learning from our bad experiences. The reason I insisted we could turn the store off was that I had seen an external dependency gone sideways cause an issue at a previous company I worked at. The reason that we had a really great checklist for the release that included taking that backup was that my coworker from his military training knew that you needed to think through all of the possibilities and written everything out. We learn from our own experiences and we learn from other people's experience. So I hope that all of y'all will take something out of this talk whether that's incremental back off or having a backup or automating everything you can. You also need to be able to communicate with the people you work with and need to be able to communicate clearly and honestly. You need to be able to say I messed up. Something went wrong here. This is what I'm seeing and know that they're not going to freak out and they're not going to blame you. And you need to have group ownership. We don't want silos. So we have a new person on the team. Drag them along. Bring them along on the field trip to the data center. Hell of them sit over your shoulder when you do the release and then the next time the release you sit over their shoulder so that everyone knows how to do these things so everyone can help out so that you have a high bus number. And everything I'm talking about is stuff that comes up in postmortems. Show of hands. Who's been involved in a postmortem at their job? I've been involved in a couple. And I work at Google now and part of my job working in DevOps working in DevOps advocacy at Google is that I get to hang out with the SRE team and they're fun. They have some great stories that I can't share with you which is sad for me. But one of the things that a lot of the SREs especially the really senior ones that I've talked to believe strongly in is the idea of blameless postmortems. And here's a quote from the SRE book that came out six months a year ago. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. And that's that trust piece. If a culture of finger pointing and shaming individuals or teams for doing the wrong thing prevails people will not bring issues to light for fear of punishment. And that's the communication part. And I've had the joy of seeing this in action. There's been a couple incidents that I've been not involved in. I have had visibility into where someone clearly did something boneheaded. But they didn't do it because they were stupid. They didn't do it because they were dumb. They did it because the system was set up in such a way that it let them do that. And so when the postmortem happens after the incident is done, everyone's had a chance to scale back and feel better about things. The discussion isn't why are you such an idiot? The discussion is how can we make it so that no one else does that same boneheaded thing because we can totally understand why you did that. We can totally understand that that made perfect sense in the situation you were in. And that's great. And that's one of that culture on the couple teams where I've had it involved. We're just going to assume that everyone had best intentions and we're not going to look for reasons to fire people means that we learn and we get better and our systems get better and everyone benefits both at the company we're at and the places we go in the future. This wasn't actually where I intended this talk to go but it is important as I was reading through these stories and trying to decide so I started with the list of stories and I'm like, how do I tie them together? And one of the other things that came up for me is that all of these stories, the diversity of experience and the diversity of skills of the folks that I worked with were what saved the day. Whether that was having a co-worker who really like soldering electrodes together for fun and making crazy things with raspberry pies so they could save the day with soldering or whether that was having a co-worker who had been a submariner and so therefore nothing that we were dealing with would ever freak him out because it was not actually that important or whether that was having co-workers who had had a lot of experience debugging node applications that went haywire. Having that diversity of experience, having that diversity of knowledge was what made our team stronger and having it isn't sufficient. I know a lot of teams that are like, yes, we have diversity, check but you must value it and you must cultivate it which means that in a crazy meltdown situation maybe you should listen to that person who's only been on your team for six months. Maybe they have an experience from their internship that will exactly help the situation you're in. And also, maybe if you're that person who had an internship experience that will exactly help the situation you're in, you might be wrong and you perhaps should listen to the folks who know this system inside and out but everyone should be listening to everybody and be open to the idea that everyone's experiences are what's going to get you through the current crisis. So I'm going to say thank you. Here's all my contact info. Again, I'm the Thagamizer on Twitter. I am on GitHub as Thagamizer and I blog at Thagamizer.com about a mix of here's how I do this cool thing with technology and here's why tech culture sucks. I work on Google Cloud Platform as a developer advocate. My primary focuses are DevOps and Ruby. So if you want to run a Ruby site or figure out a way to do that DevOps-y thing or you just want to go containers, I can probably help you out. And I have stickers and tiny plastic dinosaurs because I always have stickers and tiny plastic dinosaurs. And because this talk is right before lunch and we're still a little bit early so I'll take some questions. But I invite anyone who wants to tell me there this one time the CEO visited the data center story or that day I broke the internet story to join me at my table at lunch and tell me stories because this is one of my favorite parts of conferences is hearing about all the ways that people have messed up. So thank you. So questions. Anyone have questions? So the question was how do you convince a team that is being, we'll go with recalcitrant to adopt better engineering practices like gradual back off, perhaps the ability to isolate external dependencies. So I can say with utter and complete confidence that I have worked on a team that did not do several of these things. And a great piece of advice Ryan Davis gave me is you get to have one complete under tantrum at work a year but you should pick it light wisely and you should schedule ahead of time. Which is a fantastic way of saying pick your battles. And so I've picked my battles on some of these things. I lost the fight on having a graceful back off from an external dependency. But I did win the fight on gradual back off. And the best way I found is to tell horror stories and to also just be a pain. Not be impolite about it but just be no, we need to do this. No, we need to do this. I really don't wanna take the pager if you are not gonna do this. But to get there you have to be fairly senior to be able to get away with that. So time, sometimes letting them feel their pain in one of the situations I didn't tell we were working on a site and someone else had taken over maintenance of it and we sat there and watched it melt down while they didn't know how to deal with stuff because they hadn't let us do the right thing for a couple hours before we finally stepped in and saved the day because we wanted them to realize their pain. So sometimes you just have to let them feel their pain. This is what betas are for. So yeah, other questions. Awesome, I didn't figure there would be many because stories. So thank you all, go have lunch.