 I'm Jessie Toth. I'm known as Jessie++ on the internet. And I'm here to talk to you about easy rewrites with Ruby in Science. So a little bit about me before I get started. Like I said, my name is Jessie Toth Toth. Jessie++ anywhere on the internet that matters. I do a lot of back end Ruby stuff for GitHub related to our giant Ruby on Rails app. I like to do things that cross over with some database stuff, some Git stuff, some permissions stuff, all sorts of fun stuff back there. So onto the rewrites. I have to admit that I said easy rewrites, and maybe I lied because you may know that rewrites are never easy. I don't think I've seen any rewrite that's easy. In fact, a legitimate reaction to someone saying they want to rewrite is a rewrite? What? Why do you want to do that? That sounds like a terrible idea, because a lot of them fail. It's really hard to do a rewrite, first of all. They take a long time. They take a lot longer than you expect. Most of them that I have seen, they don't ever finish. You keep rewriting and rewriting and rewriting, and it's never good enough, or it's never the same as the old system, and then maybe you throw away and you start a new rewrite. You rewrite the rewrite, and you just keep doing this. So a rewrite is a pretty scary thing to start. And the rewrite that we did at GitHub was extremely scary, because it was way bigger in scope than anything that I've ever done in the past. But it was pretty successful, because the tools we used. So this was what the rewrite was. We wanted to rewrite our permission system, and we wanted to create a more flexible system to grant and provoke access to repositories, forks, issues, pull requests, teams, organizations, basically anything that's controlled by permissions on github.com. And that's scary. That's pretty far-reaching. That affects just about every single page load of github.com and every single API request. So touching a lot of stuff. But to understand why this was necessary, I want to give you a little bit of history of what the system was before we decided to rewrite it, and why a rewrite really seemed necessary. So first, there was collaboration. When github started, we let people collaborate with one another. And there was a feature you could add someone as a collaborator to your repository and give them access to your code. And you could use pull requests or use things to collaborate back and forth and work on the code together. And as github grew, the original collaboration was not enough. It was basically just two or maybe three people working together. But people started to have teams, or they started to put their companies on github, and they needed more effective ways to organize the sort of access and permissions. So then we added organizations. These were ways to group your teams together and give them access to different repositories that your organization controlled. But there were problems with this from the start. One of the biggest problems was that these two systems, they came in at different times, and so they ended up having different ways of granting permissions to things. The old collaborators, they granted permissions one way. In fact, the table looked kind of like this. It was a super simple joint table. It said this user has access to this repository, and that's it. But then when we did organizations, they implemented it a slightly different way, which was team members. If you look at this closely, you might be scratching your head, and you might notice that this is a three-way joint table, which is pretty terrible. It's joining on a team, a user, and a repository. So here, users could be a member of a team. That's how you said you were on a team. But a repository could also be a member on a team. That's how you said this repository gives access to these team members. It wasn't the best schema, and it cost us a lot of problems. Places where it started to cause us problems were places where we needed to get lists of things. So there were a lot of places that needed lists of particular repositories, lists of pull requests you had access to, or scope to this thing, lists of teams even. And they all needed to access the permissions in slightly different ways, depending on the kind of data they wanted. So we had things like the repositories and organization controls. You need to get those in one way, versus if you're going from a user perspective, what pull requests can they access? Well, those might be different based on whether they have access to individual repositories as a collaborator, versus through organizations. And as time went on, we found that there were a lot of bugs around different educations and transitional states. We actually have a lot of transitional states in GitHub. You can be added and removed from a team. You can have your access removed. You, as a user, can transform yourself into an organization. You can transfer your repository to another user. You can do all these crazy things. And there were lots of bugs and lots of craziness. So we started to see issues like this. People could see things on this dashboard that they didn't actually have access to. When they got these lists of pull requests, it said, you can access this one. But he would click on it and go somewhere into a 404 and say, oh, sorry. When we get to this page, we finally discovered that you no longer have access. And we kept seeing these sort of problems. In addition, we started to see a degradation of performance. As GitHub got more popular, there were more repositories to be pulling in. There were more issues. There were more pull requests. There was a lot of stuff to be grabbing. And each of these places, they all grabbed these lists in different ways. And they all started to have performance problems at different times. So different people would come in and say, oh, there's a performance problem here. Let me optimize it. And each person did this a little bit differently in each place. But they all ended up kind of like this, these giant hunks of optimize SQL. And this isn't very pretty, right? And each of them were slightly different because a slightly different person had come in and optimized this one, and it's grabbing slightly different data. So all these things kind of compounded together and already gave a good reason to rewrite. But we had one more thing to add. Defunct Chris Wainstroth, our CEO, said, you know what? Organizations aren't even good enough yet. We want to make them better. But when looking at the permission system we have, we can't possibly add anything to it. It's already so complicated. If we wanted to, say, change our permissions, we can't do that. I'd like to do that. Let's find a way to do that. And so we said, OK, we have to do a rewrite to do that. If you want that, you must let us rewrite. All this history that I've given you actually happened before I even joined the company. So I'm telling you the story of these two heroes right now. This is John Barnett and Rick Badley. They started off at the beginning of the rewrite to replace the system and see if they could make something better. So they started off with some pretty simple goals. They wanted something much simpler than what we had and much more flexible so it could be extended to different things that in the future we may want to grant permissions for, just any general permission. We want it to be fast. We want it to be super fast. Some of these things had already had permission problems. GitHub was continuing to grow and seeing bigger and more complex use cases. So they need to be fast now so that they would continue to be fast in the future. And we also wanted to make it pretty easy to operate with the things that we already had. GitHub's pretty conservative about our operation. We don't like to add new databases or new technology. We tend to stick to what we have. So we said, the old thing was in MySQL. Let's write this new thing as a table in MySQL instead of maybe going for a graph database or something like that. So they started off with an initial spike. And they wanted to be able to spike something out quickly to test how it would perform with load data as quickly as possible so they didn't get too far into writing something and then realize that it wasn't going to work. It turns out this was a pretty legitimate concern. So they wrote something that they called capabilities. And this was going to be the system that we were going to use. And John started off writing this and saying, this is how it's going to be. You ask the capability. Can this user do this thing? Can this object do this thing? But in order to test this, he was doing the rewrite. And we also needed to do a refactor because in order to test it out with production data, we needed to wait until kind of shim it into the areas that were already reading permissions and just maybe dark ship it a little bit, run it, see what it was doing versus that. So while John was writing the new system, Rick was trying to refactor just a few key touch points so that we could put this in and see what it would look like if we were to switch over it. But he ran into a problem while he was doing this refactoring. He was finding places where there were some tests, and maybe there weren't as many tests as we wanted, but the problem with the test is they also weren't modeling production data. We'd been seeing these scenarios that were really complicated and maybe they were from people that had been users since the beginning of GitHub and they had accumulated all this data. And no matter how we tried, we couldn't get these test cases into our tests to show the same sort of things. So what he decided to do was to run a little bit of experiment. He wanted to conditionally execute a path that he had tried to refactor and see did this refactoring return the same thing that the original thing did, which is usually what your tests do. But with such complex data, we said we actually need to test this in production to see if it really is doing the same thing. So that's what he did. He basically dark shipped this little refactor, and he used the instrumentation library we have to just throw off some events any time it was run. So he started off running it very few times, like 1% of the time maybe, and comparing the results at the end, returning the original code, but doing some timing and some measurement around what happened with the refactored path. And this turned out to be a really useful pattern. He was able to see, oh, I didn't quite refactor this correctly. I forgot this little case. And he was able to fix that. And he kept doing this. And it turned out it was really useful. So we pulled this into a library and called it Science. And we made it available to everyone at GitHub, because we said this is really useful, actually. You should try science and everything. You should run experiments on all of your code and see if this works for you. So let me run through an example of what science looks like. We have a repository class in our models. And here's a question that we asked repositories a lot. Are you poolable by a user? Can they pool your code? And to put a science experiment in there, what you do is you say, I want to make an experiment, give it a little name, string. And then you take the old code that used to be in poolable by, and you put it into a new method. We came up with this convention just taking the same method, throw legacy on the end. So poolable by legacy, you take that code in there. And you say, OK, for the science experiment, I want you to use this legacy code. But I also want you to try something new. Try out this new code that I did. This is just an experiment. We want to see if it's going to work. And you can add some useful things to it, like a context. We said, here's the repo we're trying. Here's the user we're trying. So if these things don't match, then we can use that context to go back and investigate and see what was happening. So each experiment would publish this. And we passed that to our instrumentation library. And we were able to gather some really neat things here. So we could see the total amount that we were running this. So how many times did this get called? We grabbed some timing data around it. So how long did the old code path take? How long did the new code path take? We also threw a custom event whenever things mismatched. So when things did not match, increased the total of how many went wrong so you can keep track of how often you're mismatching. And then we just did something super simple, where we just dumped the payload into Redis so that we could go and look at it later. And said, OK, if we have a mismatch, we want to go and investigate that data and see what went wrong and use that to go back and change the code. So they used this process over and over again. And they were able to get a decent spike out. Now, it didn't quite work. It had some performance problems. So they stopped and they threw it away. But they stepped back and took the lessons that they learned from that to build a new system. So this is the part of the story where I come in. I had just joined GitHub at this time. In fact, Nathan and myself both joined GitHub at this time. And we were asked if we'd like to join this project. And we sat down with Rick and John, and we talked through some of the lessons that they learned from capabilities and said, OK, we want to build a new system. We'll give it a new name, too, so we don't confuse ourselves. We're calling it Abilities. And so we're going to take everything that we did wrong there, and we're going to build this new system. And we really want to actually put it in now. I think we're ready. So what we came up with was something super simple like this. And it's basically, you ask a question to the system. It says, can this user read this repository? And we generalize that quite a bit. So you have a general actor. It could be a user. It could be a team. It could be an organization. And you have a subject. It's a repository in most cases, but sometimes it's a team or something else. So you can ask questions about it. You can grant things. So a subject will grant an actor a specific action. And you can revoke those things. So this was super simple. This is basically all we came up with. We had to add a little bit more. We had situations where we have users and teams and repositories. So if you grant a team access to a repository and a user access to a team, we wanted that to cascade so that the user got the access to a repository. But beyond that, that was the whole system. That was it. And we thought it was super simple. And it would maybe break down some of those huge queries that we were seeing. So we went through this. And we wrote the core of Abilities the actual rewrite in maybe a few months. There was a bit of iteration on it and figuring out what we wanted to do. We went off into the weeds a little bit. We tried to make it too general. And then we came back and said, no, we really need this. For this specific case, let's not go too crazy. But that actual rewrite didn't take very long. What did take long was the next part, which was modeling that to our legacy data. So once it was written, we said, OK, we need a way to see if the data generated by this system is the same as the old system. So we just wrote some little migration scripts. In the beginning, it was the GitHub org. Run through the GitHub org. And for every user and team and repository on there, try to generate the type of permissions in the new ability system that it has based on the data that we have in the old system. And when we ran through that, we saw a few problems. We fixed them up. But then we opened that up and started running the migrators for everybody on github.com. So let's generate the data for everyone and see, does this match our old system? And after generating the data, of course, we started off with generating the data. And that was good. But data changed from the time we generated to the time we were measuring at times. So then we added places where we were dark shipping rights to abilities. So any time you touched the old system, we said, write this also to abilities. Write a new record. Or if you're removing something, delete the record. Just do both of them at the same time. And we kept this dark ship scaled down a lot. We would do it maybe 10% of the time or something like that. We always wrote, but we didn't always read. We didn't want to put too much load on the system. But once we had that in, we wanted to science everything. So we wanted to add science to everything. So any place that we read data out of the permission system, we added a science experiment. And we said, OK, keep reading the old system. But now start reading the new system and tell us what the differences are. And that's the point at which we could start looking at all this data that we had generated and seeing what we had. So it looked kind of like this. We built this dashboard to show the graphite data that we had. Our instrumentation all goes into graphite. So we had graphs and stats on how many mismatches we had, how much is this running, how many total things. And we could, at a glance, see how all our experiments were doing. So this was a health check. Every morning I would get up and I'd say, OK, how's abilities doing this morning? And I would look at one specific experiment. I'd drill down in and I'd say, OK, how's poolable by doing this morning? Well, we're running quite a few of them. You can see the top graph is how many total times the experiment has been run. And then it has a little line on the bottom for wrong. But because there's such a huge scale between wrong and total, you can't really see it. So we made a more zoomed in graph below that, which is how many mismatches you have. And then we also cared a lot about performance, especially because we had this dark shift. We can't have this being very slow. So we also had graphs for what is the performance of the new thing that you were trying out versus the old code. And we just kept looking at this. And once we saw the mismatches, then we wanted to actually analyze what we had. So we said, OK, there's about 20 mismatches per hour. What are they? What's happening? What's going wrong? So at this point, we did this super simple. You can just jump into the console and pull these things out of Redis and say, OK, how many times has pooled by a mismatch? We've got about 3,000 results waiting for analysis there. And you can just pop each result off. And what it looked like was a super simple hash. It said, this is the experiment we're running. And then it had some things for the candidate, which was the control, which was the legacy, and the candidate, which was the new thing that you were trying. So it would show you how long did it take? Did it raise an exception? And what was the value returned? In the case of pooled by, this was just a Boolean method. So it was true and false. But with the context that we added to this, so we had the repository and the user, then I could go in to the Rails console and I could start investigating. I say, OK, what happened to this user? I could walk through the legacy code of pooled by, line by line, and say, OK, well, it matched this condition, this condition, this condition, and this is where it went to mismatch. What went wrong there? So what we found were a few things. There were definitely bugs and abilities at first. Like I said, we hadn't completely modeled the old system correctly to begin with because we didn't even have the whole thing in our head to begin with. So at the first time when that happened, we would fix a bug. And then we would say, OK, we've run our migrators, we had all this data in the database, let's just truncate the whole table and re-run the migrators and fill it up again. Because at that point it was a bug in the system itself, not anything else to fix. But once we got past that, that didn't take too long, then we had something more, which was problems with our data. In fact, we ran into a lot of data quality problems. I can show you a, this is going to go. Sorry. It's a sampling of the data quality issues that I saw. There were a lot of them. And this is where we spent probably a bulk of our time. We found quite a bit of problems in the data that we had in the database for the old stuff. People had had something wrong. Maybe they fixed the bug, but they didn't know that it had generated a whole bunch of old data that was really bad and they didn't clean it up. And that data just kept going. It interacted with other data and got uglier and dirtier and more terrible. And so we had to track it down, each and every case, and find out why it got that way, how it got that way, and fix all of them. When we first saw these data quality problems, we thought maybe it's one or two things. We can just ignore it, switch over to the new system. It'll be fine. No, it was definitely not fine. There were a huge amount of data quality problems. So we said we need to fix this in the old data in order for it to generate new data correctly. And we need to be sure that they're both matching and true and correct for the right reasons. And because we did this so often, we ended up writing a lot of tools to help with data quality. We wrote a library for running data transitions because we did it so frequently. It used to be we'd just run rails, migrations, but that was too brittle for us. We needed to be able to run them in parallel. We needed to be able to run a lot of them at the same time. We needed to be able to throttle them. There were times when we were deleting millions of rows and we didn't want to be attacking our database just because we're doing this maintenance to clean it up. So we built throttlers to run this slowly over time and get rid of it without anybody noticing. We also have problems with just the legacy system itself. There were some things that just weren't thought out. There were features that were added together that didn't work together properly. In particular, I ran into a problem with networks of repositories and forks when they had different visibilities. Some were public, some were private. The permissions were just totally messed up. So I had to stop working on abilities and I had to go fix that. I had to fix the code. I had to fix the data. I had to contact a bunch of users and tell them that I was going to do some crazy things so the permissions on their forks. It was long and it wasn't involved, but we had to do it to move forward. Otherwise, the system, it just wouldn't have worked. We wouldn't have been able to generate the right thing. So we got through all that. And then we ran into some other stuff. So like I said, this was dark shipped and we were watching the performance the whole time. In general, the performance was good, but we started to see some interesting cases. The graph up there in the top showed something that started to happen to us every day at 5 p.m. We had this one particular customer that was doing something very interesting with our API and they were mucking with the permissions every day at 5 p.m. and they had teams and repositories sizes that were much larger than we had dealt with before. And they were making a lot of changes. They were basically trying to delete all their permissions and put them all back one right after another, which sounds terrible, but we ought to be able to handle it and we weren't. Abilities wasn't handling it. The old system seemed to do it just fine, but abilities wasn't handling it quite well enough. So we went back and we reworked abilities. We had, we were blowing out a lot of stuff into the database and we found there was a way that we could infer certain things. We didn't need to write rows. And what that ended up doing was we were able to delete 72 million rows of these things. And you can see the graph that Nathan shows there. The next, after he finished this the next day, there was no blip at 5 p.m. So having that dark shift was super helpful for us. We could work on the performance over time and then we could see things as soon as they happened. This wasn't, you know, we ship it and then we walk away and we say, we don't care anymore and then months down the way, down the line, something crazy happens. We could see this slowly as we were ramping up. We could even turn abilities off completely so that if we really didn't want this to happen, we could just turn it off and not have performance problems at all. So that was really helpful for us in developing it. So at this point, where are we? We've done a rewrite and a refactor and a bunch of science and a huge amount of data quality repair and some performance. And it's taken a long time. This is probably close to two years in and this is where our progress is right now. None of these things are using abilities. But we just kept going. We kept doing that loop over and over again, you know. Have the data in there, read it out, find the mismatch, just find the data quality problem, fix it. It's over and over again. This was my life for months and months and months. But finally, we were able to start flipping things over and we did them piece by piece. We started with organizations. So check, we got those. Then we did teams. That wasn't too bad, check. But the big thing that was left was repositories. And this was the biggest thing that got us into this problem in the first place. So, and it was a place where we had the most data quality issues. So it was gonna be the hardest. It was gonna take the longest. We're expecting this, but we kept working through it and eventually got to the point where the science was mostly green. There was one last data quality issue. And the problem was that when users were being deleted, they weren't being removed from teams so that wasn't being cleaned up. It was just some legacy data that was there that we weren't gonna do that in the new system. So I said, okay, I'm gonna write a data transition like I have before to clean this up. And I'm gonna run it and I ran it and I queried something afterwards and I said, shit. That doesn't look right. I ran another query. Oh shit, oh shit, shit. I just deleted every single repository from every single team. I mean, this is the reason we wanted to refactor this. This three-way join table was so bad that it was really hard to write the correct query for it and it bit me. I mean, one last time before I could get out, it bit me in a really hard way. And so you can see, I mean, we put up a status. We said, some of you may not be able to access your repositories. But, but we did have abilities. And I said, you know what? Well, we have backups for our database and our database administrator is fantastic. Like he was right on. He's like, okay, we'll get this. We'll get all the data back in. And I said, while you're doing that, I'm gonna turn abilities on. I'm gonna switch it over to the new system because it's pretty much ready and right now it's more correct than the old system. So I did that. We got to have a little bit of a fire drill with abilities. And so they got all the backups there and got all the data back. And I said, okay, I'm gonna revert that because I'm just not confident that we should leave it on. I wanna be absolutely sure we should leave that on. So once I cleaned up my mess and got all that ready, I went back and I looked at the experiments. I said, well, we're all green now. That was the one last data quality thing I had to fix. But I was a little gun shy, especially after what had just happened. So I said, all right, I wanna be really sure. I wanna be scientific about this. I wanna do something that runs through every single user in Repository and calls us to be sure that there's no bad data sitting there that we just haven't hit because nobody's brought it up in production. So I made this transition. I said, I'm just gonna iterate through each of them. I was expecting data quality problems to come up. I was like, this isn't gonna be the end. I'm gonna find something else, I'm sure. But I didn't. I ran through all of it and there was not a single mismatch while I ran through it. And so I said, all right, guys, it's time. Let's switch it over. And the way we did this with science is just to be sure we would actually switch the use in the tri-blocks. So we'd say we'd keep everything the same but switch what the control and candidate we're doing just to be sure that we could still measure performance and see that there were no mismatches happening after the switch. And then once we were extremely competent, then we would remove the science. So at this point, we've done it. Everything, organizations, team, repositories, it's all using abilities. This happened mid-August and besides that little blip where I removed your access to repositories via Teams, you shouldn't have noticed. We open sourced this library because this was tremendously useful. I cannot even imagine doing this sort of rewrite without having this tool. I mean, it helped us to write the original library. It helped us to find massive amounts of data quality. It helps with problems in just the logic itself. I don't know how else we would have done it without this. So if you're doing a big rewrite with this, I would recommend that you use something like Scientist or use some sort of data and graphs and be very, very sure. And I hope that you use it. Thank you.