 What is up, party people? Thanks for coming to my talk here. I guess you didn't have a choice. It's called Can't Touch This, when things go wrong with an appliance. So, there we go. Okay, so this is me. I consider myself an okay developer, an okay systems guy, a jack of all trades. I'm working on a beard, like PJ said. I work at GitHub on the enterprise support team. It was weird, like PJ asked me to come and speak and I was like, oh God, I'm not really a developer. I'm not gonna have a bunch of like cool ruby stuff to talk about. So forgive me, but I'm gonna talk about something that's interesting to me and that's supporting something that we can't touch. The other thing too was like PJ, when I was telling him I was gonna name this talk, Can't Touch This, he's like, oh dude, gotta get hammer pants and all this stuff. And I was like, all right. And I looked on eBay and they're really expensive and they don't make them anymore and they have like Zubas and those are lame. But I did find some sweet pictures of hammer pants, like this one. That would be me. It doesn't look very good. I personally actually, hold on. All right, there we go. So like that's vanilla ice on the other side with like the America one. I thought that was way cooler than hammers, hammer pants. But you can also see what PJ would look like. I think he has a little beard under there. I just pasted him on top of my head. So you can also tell I should add Photoshop Grandmaster as my skill list. All right, so yeah, let's talk about how we make GitHub Enterprise customers happy when shit goes south. I hope you like it. It's not really, really about that though. It's really about how we can get around a big problem and it is that we can't touch this. But this is the appliance and by the appliance I mean it's an extremely complicated appliance with users that come to us asking for help and expect to talk to experts and everything and anything. But there is no such person. Oh no. It's just catching up. And we've tried to find these people. They don't exist anywhere at GitHub. So I work in support. What is that? It's a little different than you're kind of used to I would imagine. It's not ancient aliens. And it's not like, I was playing Diablo 3 and some guy hacked my account and I couldn't get in and I was like, oh Blizzard help and Demon Slayer, Princess, Fairy49, help me out and recover my account. We don't really do that stuff. It's actually really cool support. It's kind of like this. Like I don't know if you watch True Detective but it's like that, right? We wear suits, we do awesome stuff. We usually have this big problem and we have to go through all this forensic evidence and piece it together. And I'm the guy drew a beard on there with my Photoshop skills. It's pretty awesome. I had to have the sunglasses too. I made this gift myself. So a little background about GitHub Enterprise, right? You can't touch it. It's an appliance. It's an OVA that's imported into VMware. We also support VirtualBox unfortunately. An OVA is really an open virtualization appliance. It's a tar ball of an OVF which is the open virtualization format. It's just a VM image that has some metadata that tells the hypervisor how to run it. It gets upgraded so we would do releases. You upload this package file. It's really just a bunch of .debs and then Chef goes and installs them all. Runs like database migration is that kind of thing to get it updated. And it contains all of GitHub, gist, pages, and then some super secret special sauce for enterprise customers. We'll talk about that in a little bit. All right, so the whole thing though is like go as it runs on one VM and we say, oh, you can do it with two cores and eight gigs of RAM, right? And if you think about it, that's a lot of stuff on one VM that is maybe a mid-tier instance you get on AWS. So obviously it's not always perfect. But it works surprisingly well. For most customers, if you have 50 developers, it's kind of amazing how this one little VM running all this different stuff works really well. But it is software. And some guy named Jason, I'm sure said this. There's a guy on our team named Jason and I was like, ah, I'm sure he said this at one point, so I'll just credit him with this. I mean, it breaks and people do stuff with it. You don't intend and we say, oh, you can run it on this one VM with eight gigs of RAM and it's not gonna scale to everybody. Like 50 developers will have a CI server, a Jenkins server and it can pull it and it's fine and it's probably okay and they're not gonna hit a lot of issues. But if you're like GE and you've got 20,000 developers or something. Yeah, you're not gonna have a CI server, you're gonna have CI infrastructure and that means you're gonna have polling and clones and checkouts and stuff all the time. Tons of IO, tons of users. In fact, at that scale, it's like your CI is probably generating more traffic than your users are because of all of the crazy stuff it's doing. Beyond that, people do crazy stuff. I mean, people set up weird cron jobs on their machines to go and do all this polling. GitHub.com has all kinds of rate limiting that happens but that's not included here, right? Because we don't want to rate limit people that are paying to have our stuff running on their own infrastructure and have control over it. So things happen, stuff breaks. And we can't use New Relic, we can't use Honey Badger, all the stuff that you use to maintain your app and diagnose issues, we don't get any of that because this stuff is running behind a firewall and they don't want us to touch it. Which brings us back to this guy. The other thing besides not touching stuff that he does is he's too legit to quit, which is always good. And that's what we needed to be, right? We decided we needed to engineer something around this to help us at least get some context and try to actually help these people. And just like anything, we took the most simple, straightforward thing we could do. The one thing though, when I do say we can't touch this, like customers can SSH to their clients. They can go, they don't get root access but they can go and they can SSH as an admin user which is unprivileged but it lets them do basic stuff. They can use free, DF, top, that kind of thing. There's some scripts they can run to do all kinds of maintenance. They can mess around with the repos using git so they can run git gc and git fsck and that kind of stuff. But they can't mount some arbitrary volume on the appliance. But they can do other stuff. For instance, if they need to re-index elastic search, they can do that from the command line. They can install root cacerts and that kind of stuff. So yeah, it's the one, it's important for them. So the first thing that we decided we need to do whenever somebody comes to us and is like their hair's on fire and they're like, oh my God, help. Like we need to have some kind of context. So what we do is we have them generate a diagnostics file. The diagnostics file is just a text file and they click a button in the UI and it's like, here's this file. Or they can go over the command line and run a command and get it. It contains just like everything you would need like if you actually needed to get context on this appliance, like everything you need. So things like the release they're running, shaws of like what packages were installed, disk usage, connectivity tests, like GitHub Enterprise lets you connect to all these different authentication, different types of authentication. So like LDAP is huge. Like if anybody works in an enterprise, you probably sit on LDAP or CAS or something like that. You can use all that and all of the configuration for that. It does show up here and we also check to make sure there's connectivity. It's super important. A lot of the problems we end up seeing are directly related to this kind of stuff. You also get some stuff like your PS output. So you can actually see what's running, when things started, all that kind of stuff. The big thing though with this file is because this is like the first, like we need this, like this isn't something that we can't, it's very hard to do anything without having this, right? It has to be reliable. It's really simple, right? We wanted to do something simple. It's all just bash. It doesn't require like anything to actually be running on the VM. If something isn't running, it fails gracefully. We designed for that. And if something isn't, like if the GitHub app isn't working, like we'll know that just from looking at this, which is really helpful. There are some things though that like it will go, like it'll go and pull rescue and that kind of stuff. So if rescue isn't working obviously, it'll just say that. Just kind of an example of what some of it looks like. It's kind of hard to read. But like here we're just looking at some of the disk information so we can just see like disk usage, iNode usage, that kind of stuff. The file system like setup is static. So like if it's not looking like that, we know somebody's messed around, all this kind of stuff. Again, here's some of the connectivity tests. So we're checking DNS, SMTP, all that kind of stuff. What's interesting too is this relies, when you have your own web app, like maybe you kind of control all these other parts. So like you kind of, we have to, we can't plan that like whoever they're using to send email is gonna be reliable, right? We have to actually assume that it's gonna probably suck. And this checks all that. One other interesting bit about this too is the git that is used is like a fork of git, it's a githubgit, it has some special sauce. So it actually will send out metrics to a collector called Gitmon. And Gitmon will stuff everything in here too. So we can actually go through and see like what operations are happening. It also helps you kind of determine like repositories that are problem children. Usually it's all related to IO. Here's like the other end of it. It's kind of long. Like I was saying, like a lot of problems are exposed just by looking at this one file that's like three pages long. For instance, we have people that say, okay, yeah, I'm getting random timeouts in our app. And it's like, if you look at this and you see that they're consuming all the memory that are using the bare minimum eight gigabytes and it's all consumed, like all their swap is consumed or a lot of swap is consumed, that's generally the problem. Same thing, I mean like I mentioned LDAP, if it's like we can't ping your LDAP server, well, you're not gonna be able to log in. The big brother to this is called the support bundle. So like looking at this file is fine and dandy for like determining that there is a problem but if you have something where it's just like random, I'm getting a 500 adding an SSH key or something, right? Like this file is not gonna actually probably help you too much. It doesn't sound like a resource problem. It doesn't sound like something's not running. There's just a problem somewhere. So obviously we gotta look at the logs and they're kind of important. The bundle gives you those, right? It basically gives you all of our log. It also includes a bunch of other logs that you wanna look at like your Rails log, Unicorn logs, all that stuff. It also includes stuff like the user histories. User histories are not necessarily a log, I would consider them and they're kind of easy to fake but it's helpful anyway. We also include dna.json since Chef is behind the whole thing and the Chef logs. The one cool thing that doesn't get thrown in here is an exceptions log. So there's a library called failbot that we use and all exceptions get centrally logged to this log. So across all the Ruby applications, everything gets funneled in here as JSON and then we get it. It sucks to grep through this huge file sometimes because JSON can look, I don't know, like grep through a huge bunch of JSON that's unformatted is not fun but we tried to fix that later on. The other interesting log is the authentication log. So there's auth log so you can see all the SSH connections that come in but this one logs all the interaction for all the different authentication adapters that there are. So if you're doing LDAP, right, it's a whole bunch of different queries that go back and forth. So we go and we pull the LDAP server and we say, hey, does this user exist? Yeah, okay, here's the credentials, logged in, okay, what groups are they in? All that kind of stuff. All that back and forth gets logged and it's really helpful for us to help. I mean, you need that stuff to diagnose any issues. The other interesting log that's not necessarily included in here but is included in repositories, we can get it as well, is called the get audit log. So get, besides sending off all this metric data about the repositories that it's doing or it's interacting or it is, it's also logging every single action that happens. So anytime it does anything, it logs it and it logs a whole bunch of data about it. So it looks like this, it's JSON again. I mean, you get everything from the repo named, who did it to like the PID to everything. This isn't something that normally needs to be looked at a ton but sometimes you can get situations where it'll be unclear like something, some repo will get weird or a commit will look different or something. This is like the absolute truth of what happened in chronological order to the repo and it's super helpful. So lastly, the other part that gets thrown in here is there's collectD running so it collects all kinds of resource metrics. This isn't necessarily for the customers. So it's not like there's a bunch of pretty PNGs that get thrown in here. But we get the raw data out of it. Customers can send collectD off to their own collectD server and monitor it that way but I don't think most of them do. You can use SNMP and that kind of thing and I think most tend to use that. But it's helpful for us and we'll get to kind of how it's used later. It's generated just like a diagnostics file. So click a button, SSH, run a command. It's just a huge tar ball with all this stuff in it but it can totally suck. Like 50 developers is fine. Like you're gonna get like a 50 megabyte tar ball and that's easy enough to work with but with those 8,000, 20,000 developers this is gonna be like gigabytes, right? And then it's like the problem is one, generating it's gonna create a lot of IO and usually if you're generating this file there's already a problem. So that sucks. And the other part is like how do you move this four gigabyte file around and get it to us? You can attach it to our ticketing system because it's Zendesk and there's a file size limit. You can't do that. You don't wanna email this thing around because it's A, not gonna work and B, like nobody wants the thing holding all of your source code, like the logs for that to just be sent around emailed. It's kind of a problem. So again, what would Hammer do? Well, he's too legit to quit. We know that. You'd make a thing called ESP Tools. ESP Tools is Enterprise Support Bundle Tools. Not a clever name but what this is is it's only just a big thing for uploading these, parsing them and then letting us download them. Customers go here, they upload it. It looks nice and all kinds of stuff provides them a nice secure place to do it. And then it gives us this nice dashboard. I don't like dashboards but this one's all right. You get a bunch of information right off the bat, right? It extracts it, parses it, you get this. One of the cool things that you'll see is there's a diagnostics tab, there's an exceptions tab which is great because I'm getting ahead of myself here. I'll get to that in a minute. But you also get a tab for the graphs and files so you can actually go and browse the whole thing in your browser. One of the things about that that's nice is one of our tendencies is you want to have a URL to anything basically. Like if there's something you need to share with somebody you should be able to reference it via URL. This allows us to do that. If there's a file and there's something in there you want somebody to look at, you can just send the URL and they'll be able to look at it without having to download this huge monstrous tar ball. It can totally suck though because if you're looking at a 200 megabyte Nginx exceptions log in a browser tab, not so good. So one of the things too is it scans everything and looks for things that are obviously wrong. Back here, there's these little like red triangle things. Like that stuff that we know is probably a problem, right? These guys have a low to average of 2.6. They have two CPUs, that's not great. I mean you want to keep that at or below the number of CPUs you have usually. They have the bare minimum for memory and we can see they're consuming a lot of swap. Like there's problems and it's easy. The other thing too is let's assert it. So I mentioned that exceptions tab and like that exceptions log that was all JSON. That all gets sucked into Elasticsearch and you can query it from there. It's a lot better than trying to grab through it. It also like I said allows you to then search for something, find an exception, find a line and an exception and send a URL to somebody who might know something about it and they can take a peek without having to actually do anything except open their browser. It also takes those collectee, the collectee data, makes pretty graphs. That's about it. Perhaps though the best thing it does is allows for remote extraction. So downloading, it's fine to look at all this stuff but you do need to download it and grab through stuff. Like grepping and whatever is still gonna be better in a lot of cases than having to look through it in a web UI. And a lot of us tend to work in places that may not have great internet so that means like Starbucks or if you have some farm in Australia like one of the guys does. You'll have like one megabit internet. This allows you to SSH somewhere, actually go through. It's already been extracted and you can do whatever you want with it from the command line. It's super helpful and it's probably the most, the best feature, the thing that we end up using a lot and it's super simple and easy. So finally the other part right is it gets thrown into the Hubot. Hubot's just this bot, this chat bot that does a whole bunch of stuff. So he'll like deploy stuff, he'll go and like build releases and that kind of stuff. He'll do image searches and then like waffle bombs when the image search returns something, not safe for work. And like you gotta push it up into the scroll back. So he does a lot of stuff and he also integrates into a lot of this. So he'll tell us when somebody's uploaded something and then when it's ready and it's extracted and that's all you need. You can just copy that out and paste it and then you'll be able to SSH right in and play around with it. He also goes right into tickets. So we do everything through tickets. He also goes in tells you somebody uploaded it, puts it in there as a private comment or to download it. Again, super helpful cause then if somebody's like a salesperson has to like go and reference this ticket they actually can go and see it and look at the bundle themselves. The other thing too is he'll actually go and tell us stuff about tickets. So he'll go and say hey somebody opened a ticket that they consider urgent priority. Somebody go look at this and then like we'll reply and like put it to a pending state. That means that we're waiting on somebody to tell us something back. Well he'll go and say when they reply to that too. It's super helpful. But in the end, none of it's rocket science, right? Like it's all really simple, not that interesting stuff. It's basically just us trying to automate away all this tediousness. But the truth of the matter is you still need to be there sometimes and you still need to be able to run things on the VM. Like all this awesome stuff it doesn't actually get you all the way there. For instance, there was a customer who had this repo that was getting corrupted and we were like okay yeah let me help them and get it uncorrupted and it's fixed and dandy. And a week later the same repo again corrupted and it was like all right that's kind of weird. Pretty sure there's no bugs we know of that are doing this and then we fixed it. And then like the next day and then it just kept happening. And it got to the point where it was like they would interact with it at all and it would just corrupt. And it's like all right this sounds like silent data corruption. This sucks. And like that's super hard to prove, right? They have a SAN connected to it. Bunch of other VMs using it that seem fine. And we're like all right well we had to the way we went about testing it to try to prove the theory that there was something on their SAN that was bad was let's make a giant file and just copy it around on your VM. And of course we had to like tell him and walk him through how to do that. Of course he made the file, checked an MD5 sum of the file and then moved it and checked the MD5 sum again and it changed. So obviously the file had changed. And it happened right away and we got lucky but that kind of stuff does need to happen. As far as actually like us SSH-ing too is like getting a VPN in and like SSH-ing in and all that kind of stuff. It has happened maybe two or three times but I honestly can't remember the last time. So that's kind of cool but we haven't actually needed to do that too often. But there's one more part right? Like I didn't even get to the most important part and like this slide right? Talks really about how we have this extremely complicated appliance with users who expect to talk to experts and just want anything and everything and we can't touch it. And that those people don't exist. Like it's certainly not me. Like I consider myself like knowing a little bit about a lot of stuff but not an expert in any of it. And even back to this slide. Like I'm a true detective, HBO is totally like, oh yeah we gotta have two of them so it's interesting and all this stuff but like they're detectives and they have partners because neither of them can do it on their own. They need to have somebody else to lean on and to have expertise and all this stuff. Obviously I'm still woody because he's way cooler but they need that right? The thing that I've found working in support and I think this is true of just what anybody in technology is that if you try to be this lone wolf guy who like sits in the corner and like tries to just bang everything out and be awesome like you're gonna fail at it. I've seen a couple different kinds of this person. Like there's the people who like are too afraid to go ask other people for help or to collaborate on something and there are the people that make people that way. The people who are judgmental and tend to, why are you asking me this stupid problem and question? You should know this. Both of these kinds of people, this kind of person is the most toxic person I've ever worked with and luckily I haven't had to work with a lot of them or when I have they've been dealt with. And that's the one thing that I love so much about working in support and with the team I get to work with is that we aren't like that and that's the same thing I found at Engine Yard which was one of the best places that worked as well. So anyway, that's my talk. I hope you enjoyed it. You can find me on Twitter at your hot. Yeah, I thought this gift was pretty sweet and then you can email me as well but I don't check it. So thank you. Thank you.