 Connor Shea from GitLab. I'm a community advocate and front-end engineer. And today I want to talk to you about handling a major outage transparently. So if you don't know what GitLab is, essentially it's a platform for software development and also DevOps and pretty much everything related to software. So we have the full cycle from having an idea to making an issue to actually developing it and then testing it and then deploying it and back around. So we do source code management, issue tracking, code review, continuous integration, deployment, et cetera. So just for a show of hands, does anybody here use GitLab? Awesome. And then how many of you have heard of GitLab? How many of you heard of GitLab? Because of this incident? OK, not as many as I was expecting. That's good, I think. Not that. A lot of people get those confused. So some background, which I think is important for understanding why and how we did this the way we did. GitLab is remote only. We have team members in 36 countries. Almost all our products are either open source or source of visible, meaning you can go and look at it. You can contribute to it. Or you can look at the code, though it's proprietary, so you can't actually reuse it. But you can learn from it. We use Rails and Go primarily. And because of that, transparency is a core value at GitLab. So we have our handbook open. Anybody can go and look at that. Salaries for most positions are available on the page based on location and experience level. We have a calculator and everything. And then we also keep all our cookbooks, our chef cookbooks, out in the open. We keep actually a monitoring dashboard out in the open, which isn't as comprehensive as our internal one, but is pretty comprehensive. It's monitoring.gitlab.com, I think. Or .NET, I don't know. So on January 31st, we had an incident or accident where an engineer accidentally deleted the production database for the live GitLab.com website, which was fun. We ended up losing six hours of data at the database level. No Git repositories were affected, because those are stored separately. But issues, comments, merge requests, et cetera, were unfortunately lost in that period of time. So handling it, we immediately tweeted out a link to a live Google Doc we were making. The infrastructure team was making, which was used for the retrospective afterwards, or the postmortem. So you can imagine the potential stress this would cause. This has almost 3,000 likes on Twitter. So there were a lot of people watching that Google Doc. So we created Google Doc to track progress, and Sid, our CEO, said that we should share it with the community so that they could see our progress since the site was down and people wanted to know what was up. And it ended up on the top of Hacker News within about 15 minutes, which was very, very fast, a lot faster than I was expecting. So there were a lot of people watching this happen live. So much so that we actually did a live stream. And so we had a, it wasn't a Google Hangout, we used Zoom, but essentially Google Hangout with a bunch of employees that were working on it, talking, and then also the VP of Product Yob. And so because we have a remote team in 36 countries, we were able to pretty much run this throughout the night. The incident started about 6 PM Colorado time, and then went until, I think, noon the next day. The biggest problem with the backup restoring was we used Azure. And the backup was stored, I believe, in US West. And the actual site is in US East. So we had to transfer it from one to the other, which takes a lot longer than transferring it within a single data center. So we ended up number two on YouTube's trending section for a while. At one point, we actually had more viewers than the White House press briefing, so that was awesome. So what if you want to do this yourself? People are laughing, but a lot of people, surprisingly, said that they would do this at their companies after we did it. Protect the identities of anyone involved. So we used the, essentially, codename Team Member 1 for the person who deleted the database. We didn't tell anyone publicly who did it because we didn't want a lynch mob coming to get them because it wasn't really their fault. It was the fault of our processes and the fact that this was able to happen at all. Allow engineers to opt out of sharing their faces. I don't think we actually had anyone that did that, but a lot of people would not be comfortable live streaming to, like, 5,000 people while doing their job. Only share your screen if you're 110% confident safe. If there's any chance that you're going to have database information or passwords or anything on the screen, don't share it. Generally, we shared a screen that showed the terminal, which was connected to a zero that was showing, essentially, a progress bar of the backup being restored. And I think that was pretty much all we showed. Be careful about showing webcam footage if it can see your keyboard because people could figure out your password from that. And just make sure there's no way, like, if the NSA was watching that they could figure it out. Although maybe they could because they're the NSA. Don't distract engineers working on the problem. So the reason we did the live stream all night was because for most of that, it was just restoring the backup. We weren't actively really doing anything. So the team was able to focus on pretty much the live stream for most of it because we weren't actively doing anything. But if you are actively doing things, for example, when we had to actually intervene and run commands and do things to get the service back running, we would have the VP of Product Yob, or someone else that was on the call, handle it from there. If you're KEPA compliant or PCI compliant, I have no idea if this is legal. So maybe check before doing it. I'm not really sure how much of this you can actually do if you have to comply with regulations like that, but it's worth considering. Working shifts obviously don't have any one person working for more than a few hours at a time because then you could have someone delete the database a second time accidentally. Audit your backups and procedures and make sure this can't happen again. That's pretty much what we learned. We needed to have someone who owns the backups because at the time, we didn't. We do now. Communication with users in really bad situations is really, really beneficial to them. It leads to better outcomes. People were pretty much praising us, not for deleting everything, but for sharing with the community the whole process. We didn't want to keep anyone in the dark on this. And users really value transparency. Yeah, we got a lot of really positive feedback from a really bad situation. And I don't think we would have gotten that if we had done it the usual way without sharing things with people. Why do this at all? It's the right thing to do. People want to know what's happening with their data when their data is impacted. And whether or not you want to do the full livestream or just keep a live post-mortem, it depends on the company. A lot of companies won't be able to do this because of security compliance, rules inside the company, et cetera. Users will be a lot more forgiving if you're transparent about things like this. I think it does help that we're a product for developers. So most of our users are developers and understand that this wasn't necessarily all our fault. And pushing issues under the rug will hurt you in the long term. If you hide things from users, eventually they'll figure it out. They'll figure out that their data has been lost. Or they'll figure out that something's up. So the results, we got an outpouring of support on Twitter about 3,000 tweets. I would say 95% of them positive. We had 5,000 viewers on the YouTube livestream at the peak, which was in the morning in Colorado. We got the top of Hacker News and the top of a couple programming-related subreddits. So there was a lot of exposure in that regard. We did not fire the engineer involved in this mistake. For some reason, a lot of people ask of this as though we should have. We didn't because it wasn't his fault. He actually got a promotion last month, I think. So if you want to get promoted, just delete the live production database. So this is a quote from the CEO of IBM about an incident where the company lost $600,000. And someone asked him if he was going to fire the person who had caused the problem. And this is what he said. I just spent $600,000 trading him. Why would I want to hire? Why would I want somebody to hire his experience? So we got a lot of notable support online and from other companies. Codefresh sent us a box of cookies or multiple boxes. The problem with that is we are fully remote and this was sent to our San Francisco office, which has about two people in it. So two people got to eat all those cookies. And then we made this shirt, which is our logo with a fuse. Google sent us some money to be able to essentially do something for the team. So we asked them to send us or to make shirts like this. And so everyone on the infrastructure team got one of those shirts. Jesus is French, I think, for IAM. So IAM team number one. This was sent to us on Twitter and I really liked it. We also got HugOps from our competitors, CircleCI, and also our, I guess, partner, AxisHoptGitKracken, which is a Git client. And then just a bunch of outpouring of support on Twitter from random people. Yeah, more support. As you can see, people were pretty happy with the transparency. Yeah, like I said, I think it is unique to GitLab in that we mostly have developers as customers. If you were Facebook, people may not be as forgiving because they don't necessarily understand everything that goes into this kind of thing. The internet also made fun of us. This made it to the top of the subreddit programmer humor, I think. If you can't read that, it says, nurse, sir, you've been at a coma since January 31, 2017. Me, oh boy, I can't wait to see my 300 gigabytes of live production database or data. We have a postmortem at that URL if you want to learn about the more technical aspects of this. This talk is pretty much just about the transparency. We have a bunch of issues open on our issue tracker for the infrastructure team about this. A lot of them have already been resolved, things that would prevent something like this from happening in the future, essentially. If you want to go to that URL, you can. And that's our project for our infrastructure. We have an issue tracker there with a couple of hundred issues. Anybody can go look at it. Anybody can read the issues, stuff we're doing, stuff we've done. And I just think that's really cool, because most people don't get to share internal stuff with everyone. And then what we've actually changed, we now use Wally for streaming database backups. Previously, I believe we made a backup every 24 hours, which is why we lost the six hours of production data. The reason we could recover it was because someone took a snapshot of the data before they started working, which helped recover a lot more data than we probably would have. And yeah, Q&A. Oh wait, I skipped, sorry. We started working on automated testing of our backups, and we also improved documentation on a lot of the procedures relevant to backups and things like that. So then Q&A. So the question is, did we see any user growth or attrition following the outage? I'm not actually sure the exact numbers. I don't think we're impacted significantly negatively, but I don't think we're impacted positively either. I could be wrong about that. I don't personally track that data, so I'm not sure. That might be covered in the postmortem. Again, I'm not sure, but yeah. So the question is, how is this propagated? How did everybody find out that this had happened? It was primarily a Slack channel. In the infrastructure channel was somebody said, am I allowed to curse? I don't know. Oh, blank. And then they explained what happened. And everyone else pretty much repeated what they said. So yeah, that was, I came in about 30 minutes late to that conversation and I realized what happened and I was, yeah, that was fun. At that point it pretty much became an all hands on deck type thing and pretty much everyone on the infrastructure team joined in to help. So the question is, what was the business response? So you mean like third parties that we're trying to sell to or inside the company? I'm not sure, I think it's kind of, we kind of expect this kind of thing because, I mean, like I said, the CEO was the one that asked us to share this publicly. It's kind of par for the course with GitLab. I can't really imagine us doing it any other way and even after this, I think we would still do this if something like this were to happen in the future. So I'd say it was pretty much everyone in the company expected something like this if something like this were to happen. So the question was or is, how does Wally help with the having to restore the backup from US East to US West, correct? So, yeah. So the main problem was that we set up the backups in a different region from where the live production service was. And Wally, I believe now just backs up into the same region that the main services on. So it's mostly just configuration change. So the question is, does backing up into the same region mean that we are more likely to, or that we are exposing ourselves to new types of issues, right? I'm personally not on the infrastructure team. So I don't think I can answer that necessarily, but I would think so, yes. I'm not sure if we have a secondary backup or I'm pretty sure we do. I just don't know if it's stored in the same region or if it's stored in a separate region. I would imagine a separate one, but I'm not sure. The question is, has this helped with hiring? I'm not sure. I imagine yes, but I'm not entirely sure about that. I can see it going both ways where it could scare people, but it could also make people want to join the company because of the transparency and everything. Yeah, I'm not sure. I think that's it. Thanks. Thank you, Connor.