 Ruby refactoring with confidence and science. So I am Jesse Toth, I'm Jesse++ on the internet, GitHub, Twitter, everywhere else. I am the head of web technology for GitHub, which means I work on the web team behind github.com, work a lot on our Ruby on Rails application, especially focusing on database and performance things. And this talk especially will talk about one of the projects I did around those sort of things. So, tackling a large Ruby refactoring. What do I mean by large Ruby refactoring here? Well first, let's break it down, large. Either it could be the size of the system that you're trying to change, or, and especially this is the case that I'm talking about how risky it is to make this particular change. But then, what about refactoring? Traditionally refactoring shouldn't be risky, right? It's supposed to be just changing the structure of the code without changing your behavior. Usually you can do it in small pieces, and if you have good test coverage, you should be able to catch any accidental behavior changes, so it shouldn't be risky. But really I'm talking about, in this case, sometimes really risky changes, or maybe going a little bit further than your typical refactoring, not just changing the structure of the code, but maybe changing some of the underpinnings of the code, but still without changing the behavior of the code, such as maybe switching out a service or a gem, you want it to behave the same way, but you want the code that you're using underneath to be different, or maybe you're optimizing a query, moving it from talking to your model to raw SQL because you need it to be really fast, something like that. So any code interacting with this code at its boundaries shouldn't notice a difference, but things are gonna change underneath, and hopefully be better for whatever your metric you're trying to improve, which is usually performance. So maybe a better word for this is reworking or replacing. In some cases this might work, but I didn't want to use that because I don't think it emphasizes enough the fact that your behavior really shouldn't change when you're doing things like this. Rewriting is really the best word for this, but if you've ever done a rewrite, you know that that word has lots of baggage. Most people have some PTSD around the fact that rewrites usually don't work. They're very risky. If they are of any size larger than just a little rewrite, they tend to drag on and on, and more often than not they fail because you failed to duplicate the behavior of the original system and then you're left with either something that's not matching the old system or you just have to throw it away and I'm gonna try again. So that's very risky. But anyway, I'm gonna talk about some of the tools and techniques that we use at GitHub to tackle these sort of problems, and I'm gonna do that talking through a case study which was a rewrite that we did of the permission system behind github.com. So what we had behind github.com to control all of the permissions was a huge legacy system. And the thing that we wanted to do with this rewrite was to create a more flexible system to grant and revoke access to any of the things that you could have access to on GitHub. So repositories, forks, issues, pull requests, teams, organizations, basically anything. As you can imagine from that description, this would be a pretty far-reaching change. And so if you make this change, it's gonna affect every single page load on github.com, every single API request. It's a huge service area and it's extremely risky. So if you're gonna make a change like this, you have to have absolute confidence in the correctness of the new system when it's time to switch it out. So first of all, the first question to ask is why would we wanna do such a thing, especially with such risk? So to give some background on that, it requires a little bit of a history lesson on GitHub itself. So in the beginning, there was collaboration. So one of the very first features of github.com was that you could add other users to your repository as collaborators and allow them to have access to your code and to make pull requests and make changes to it. And then that worked very well for quite a while, but eventually got to a point where we needed a little more than that. So then there were organizations. As GitHub grew, we needed to place a way to give repository access to groups of people and to kind of control those for different groups and different teams in different ways. The problem was that these two different ways of granting access were actually written in separate ways. They were developed at different times by different people. And as most legacy code goes, it wasn't obvious at the time that these were kind of the same concepts. So they were developed in very different ways and represented in different ways behind the scenes. So this is what permissions look like to begin with. This was collaboration on repositories. So it's a super simple join table. Basically, you say, here's a user, here's a repository, they have access. That's about it. And then this is team members. So the way that you give access to repositories and organizations is through teams. This one's a little weird. And every time I look at it, I kind of shudder now because it's a strange three-way join table between, well, you have the team, first of all, but then you have users and repositories. And so to be a member of the team, if you're a user, you have a row in here. But also if you want to give users on the team access to a repository, the repository has a row in the same table where the repository is like a member of the team. And this is pretty terrible, but it's what we had. Hey, it worked for a really, really long time. So with that in mind, let's talk about how we would use these things to access permissions. So there were tons of places in the code base that needed to get lists of things that people had access to. So repositories, pull requests, teams, those sort of thing. And they all kind of access permissions in slightly different ways based on which kind of data they wanted. And in true legacy code fashion, even different places like the API and the web that wanted to access the very same list, they did it in slightly different ways because they were written by different people at different times. So here's an example of, this is what you would have to look at all the repositories that an organization controls. And probably this is only going to look at the team members side. It's not gonna touch the collaborators or permissions table. Whereas on the other hand, if you want to get every pull request that a user has access to, you've gotta look at both the permissions and the team members table and see what repositories do they have direct access to, which ones do they have access to via organizations and teams, and kind of put that all together. And what we saw with this is a lot of bugs and edge cases, especially around transitional states. And because, like I said, each place that was calling this was different, the bugs were slightly different each time. But the scariest thing was, like I said, around transitional states. And unfortunately we have a lot of transitional states on GitHub, so you can transfer a repository from one user to another. You can also transfer it from a user to an org or from an org to a user. And that would change how the permissions should be represented. Or you can do something fun, which is you can transform your user account into an organization, which should also change how the permissions are granted from the permissions table to the team members table. Or when you remove somebody from either a team or from your repository, you want that also to take away their access to the repo and that's considered a transitional state. And we started seeing problems with these lots of different bugs all over the place. And this was pretty scary to us and to users because when you say you're giving access or you're removing access, you wanna be absolutely sure that that's doing the right thing. And you thought you removed somebody from a repository but actually they still have some access somehow. And beyond that, we started seeing performance degradation with this. So over time, as GitHub grew, and depending on the API that was being called, different places started seeing different performance problems, especially in our API where we had to grab large lists of things, like give me all of the repositories that anyone can have access to. And what started happening is that places that grabbed access to repositories started looking like this or grabbed access to lists of anything. They had, each time a performance problem came up, somebody would go in and optimize it and make it a little bit better. And eventually they just looked like huge chunks of raw SQL like this. But of course, they were all accessing slightly different things and they were all optimized at different times by different people. So these queries were slightly different in each case. Which of course means that it's difficult to build anything on top of this. Because everything's just this huge query, you really can't, I mean, if you wanna build something on top of that, you're adding another line to that query and potentially breaking the performance. So when our CEO, Chris Wandsworth, suggested, hey, let's maybe try to find a way to make orgs better, that sounded pretty hard. Especially because his suggestion was, well, let's change permissions in this way, in this way, in this way. But with all the problems that we had been seeing with the existing permissions, it seemed pretty impossible to be able to make any changes like this without first fixing the existing system. So that's where we put on our hard hat and said, all right, let's rewrite this. We need to be, we want to be able to rewrite the system in parallel to the legacy system. So we wanted to not touch this brittle old system as much as possible. And to be able to run experiments and do performance testing kind of side by side and use that in a way to test it without touching the other system. So we came up with these pretty simple goals for what we wanted, a super simple flexible interface to grant and revoke a general permission, so not specific to our repository permission or team permission. We wanted to be super fast since we were already seeing performance problems with the old way. Anything new we built need to scale much further into the future. And we wanted to be easy to integrate and operate with the existing technologies we had. We looked at things like graph databases and things like that, but it didn't make sense for us to go and try out a new technology. So we wanted to build it right on top of my SQL, the way the old system was built. So first thing we did was kind of spike something out. It was important to us to be able to test how the new system would potentially perform with production load as quickly as possible. So we didn't go down the wrong road and go way too far in building this new thing and find out that it just wasn't gonna scale at all. So the first spike was called capabilities and it was something that John Barnett worked on a bit. And it was a rewrite, but there was also a bit of refactoring that we had to do in order to even test this spike. Like I had said earlier, there were multiple points where even if the API and the website were getting the same list of repositories, they were doing it in different ways. So in order to be able to test this new system out a little bit, we needed to at least refactor those points so they were going through one single point and that way we could grab at that point and test the new system and the old system. But this is where we ran into a problem. So I think we tried to tackle the API and the organizations to see all the repositories that an organization has access to. But there was a problem with just doing the refactoring because it was this huge chunk of SQL. We didn't have test coverage for all the crazy complicated cases that we had in production. And even if we were able to duplicate all of those cases in tests, that would slow our tests sweet down a lot. It would still be hard for us to have confidence that we had tested every single case. There was also something that someone found where maybe there are bugs in the current implementation that people are relying on those bugs. So whatever we do, we have to duplicate the bugs in the system and make sure that we really know what the changes we're making are. So at this point, a bunch of engineers at GitHub kind of put their heads together and came up with this little hack to try to do this refactoring of this one path. And basically it was a way to test the behavior of the legacy system and the behavior of a refactored path together. Basically it amounted to dark shipping, the refactored path. So running it behind the scenes in production and then using our instrumentation library to subscribe to an event, basically compare the results of the two paths. And if they didn't match, throw the results of that into Redis and we could go look at that later and see what was the difference and why was there a difference. And this actually worked really well. It allowed us to gain confidence in these initial refactorings and it also actually uncovered some bugs in the legacy implementation. So like I said, this worked so well that we wanted to keep using it. So we pulled it out into a science library that anybody at GitHub could use if they wanted to test a new code path and they wanted to be really confident as they were rolling it out. So to give you an idea of what this looks like, we're gonna look at the pullable by method on the repository class. So to do science, you take your old method and you make a little science block and you give it a name. We tend to namespace the names of our experiments just so we could group them together because we ended up doing a lot of experiments. So then the first thing you do is you say, take the original code that was in that method, so the old pullable by method and we tended to pull it out and put it into a new method that we just named with like underscore legacy at the end and then there's the use block. Basically, you say whatever is in the use block, you want the method to continue returning that. That is the old code. And then you say, but try this other thing as well, which whatever the new code path that you want to refactor, we tended to pull it into a new method. We call this ability, so we named it with abs at the end but we said, okay, but also at the same time, try this new method, compare the results and tell us what the result of this experiment is. Also for usefulness in this, we added a context. So if things didn't match, we wanted to know who was the user, who was the repository that we're trying to test between each other. So then in a case where things don't match, you want to publish the results basically. And so our published method just uses active support notifications to instrument these events, which you can then subscribe to. And it looks kind of like this. Subscribing to a science event, you can subscribe to it and publish the interesting parts to like stats D, graphite, something like that. So we keep track of how many total times an experiment has been run and then do some timings on each part of the experiment. So the use and the try block, how long did they take to run just so we can compare that as well as the actual result. And then you can also subscribe to the mismatch event. So this is when things don't match. This event will be thrown and you want to keep track of how many times things go wrong, how many times do we have a mismatch? And then we just simply push the result of the mismatch into Redis so that we could look at it later. One caveat about this technique is you can only use this on code paths that have no side effects. So if you're doing something that's writing, you can't use this because especially if they were to change the same thing, then it's not gonna work. So using that we were able to do a little bit of spike and figure out what we wanted to do with the real system for permissions. So using that science we were able to get the refactoring into the little points that we needed to attach the new system when we did build it. So taking those lessons, we put together a team and we started building a new system that we called abilities and we called ourselves the ability buddies. So this is what we came up with for our system. It's super simple I think and I really like it. So you have actors and subjects and you ask an actor whether they can perform an action on a subject. So can this user read this repository? Then a subject would grant a permission to an actor. So you say if I'm a repository, I want to grant this user a right permission. And then to remove a permission, you just revoke any permission granted to the actor. The only additional constraint we had on this was that we wanted a cascading functionality for things that were both subjects and actors. So if I have an ability that was granted to a team on a repository and then I granted a user an ability on a team, we want an implicit ability here between the user and a repository to cascade up basically. So the user has access to whatever the team has given access to. But that was about it, that was the interface and once we figured that out, it was time to start implementing it. So we wrote the code that did the core of abilities in just a few months. It was pretty quick. So once it was written though, we needed to generate all the data to actually have the permissions in the new system. So we wrote some migrators that would iterate through all of the permissions in the legacy system and try to generate new abilities entries for each one. So the generated data, if you're generating it, it's only a snapshot in time. So then the other thing that we had to do in order to be able to compare these was to anytime that the legacy system changed, we also needed to change the abilities data. So any new permissions needed to be written to abilities at the same time and any changes needed to be updated as well. But once we did that, we were able to start doing some science. So we were able to add science to all of the read points that touch permissions and there are a lot of those all throughout the code base. And because there were so many different places that touch permission, we kind of had to split this up and parallelize it. So we had one half of the team doing the refactorings to find all the points where we needed to add the science. And then the other half was going through and using the things that already had science, analyzing the results of the experiment and seeing what was happening, were things matching, if they weren't matching, why and how can we fix that? So let's dive into what the data looks like that science collects and how we can use that to determine correctness. So this is dashboard that we built to kind of summarize and visualize the data from graphite that we had instrumented before. The graphs on the left give you kind of a day in the life of each experiment. So what does it look like in the past 24 hours? Is it mostly green, meaning no mismatches? Is there's a little bit of yellow in there, meaning a few mismatches and the ones that are really red are just, everything's kind of going wrong with those. And the right side gives a little bit of statistics on how often something's running. You can see how many times per minute it's run. What percentage of the time are we running this experiment? For things that are not very performant, you don't wanna be running this 100% of the time. Maybe you only wanna run 1% or 10% just to start collecting some data until you get less mismatches. So if you click into any of those summaries, you can see a more detailed view of each experiment. And inside here are the graphite graphs of what's happening. So those totals that we saw earlier and how many wrong is in there, as well as on the bottom there's the performance data. So you can see how is the new code that I'm testing out performing versus the old code. So you can see in this case, we're running a lot of experiments like 20,000 per minute and there aren't very many mismatches, but they're not zero. So if you see this graph in the middle that has some mismatches, then you wanna go and look at those mismatches and see what is the difference and what's going wrong. So in order to do that, like I showed earlier, we just throw these mismatches into Redis. So you could go take a look at Redis and for the particular key that we've named it, see how many mismatches are there. So for this poolable by, we have 3,000 mismatches. And you can just grab the first result by popping it off and saying, okay, show me the top mismatch. And then there's a bunch of data in there, the context that we push on, but then you can see the most important part is, tells you what the candidate did and what the control did. So it gives you the timing information. It tells you whether an exception was raised and then it gives you the return value. So for this method, the poolable by method is just a simple true or false value. So the candidate, which is the experimental refactored code, it returned false. But if you look at the control here, it returned true. So this was the difference. And we can take a look at the user ID because we pushed that onto the context and the repository ID and tried to investigate why for this particular case did it go wrong. And the results that we see from that, especially at the beginning, it was just bugs and abilities. We didn't get it totally right the first time. We didn't account for all of the cases. We didn't realize that the old system did this particular thing. So we just needed to fix that. In these cases, we would generally fix the bug, completely blow away the abilities database and re-run the migrators to regenerate all of the code in there. We thought once we fixed those bugs that everything would be good and we'd be ready to ship abilities, but then we started running into other issues. And that was with the data in the database. So if we look here, can show you a sampling of the data quality issues that we saw as we were trying to do this ability of stuff. So what we found in investigating some of these mismatches is that there were issues in the legacy permissions tables that were actually being masked by conditions in the code. So there was bad data in the database. We were pulling it out, but then something in the code was filtering it out and saying, oh, forget that. The database says that this is poolable by this user, but under this condition, we'll just forget about that and we'll say, no, it's not. And so we decided that if we wanted to be generating the abilities data from the legacy data, we needed quality data, it needed to be correct. So this is the case where we needed to go back and actually fix the old data. And you can see how often this happened. We had, this is just a sampling of a few data quality issues. We had quite a few of these and we ended up adding a lot of process around doing migrations or what we call data transitions to fix data like this. We ended up running them quite a bit. We actually ended up writing some throttling code to do these sort of migrations so that when we were deleting millions of rows that we weren't hammering our database during this time. So then once we fixed the data quality issues or we thought we were gonna be done except there was one more thing that we found that we didn't expect coming into this. There were still some mismatches and it turned out that the reason things were mismatching is that the legacy system handled things inconsistently. There is literally no way to duplicate the functionality in the new system because the old system was order dependent or time dependent based on when things happened and when the code changed there were different things in the database. And a lot of these centered around problems with forks and problems with forks that had different visibility so public repos with private forks and things like that. And so we actually had to stop working on abilities, take a break here and fix the problem in the code and come up with a consistent system. This actually took a couple months because we had to fix the code, we had to fix the data. Because these were privacy issues and it was really sensitive stuff we actually had to spend time contacting users and emailing them, warning them about the changes that we were gonna make. That was something that we did not expect. I think if we hadn't had something like science to test out this behavior we wouldn't have known it had happened and then we would have rolled out the new system and people would have started sending requests like what's happening, what's going on, things have changed. So we were really lucky to have found that and definitely thanks to these tools. The next thing we started to find was performance problems. I had showed you the graph earlier at the bottom that showed the difference in performance between the old code path and the new code path. And for these little things we optimized things at the beginning and there wasn't a big difference of performance there. The places we actually started to see performance problems were because we had had this dark shift for a very long period of time we started seeing some pathological cases. People were doing interesting stuff with our API. They were like removing huge amounts of permissions and re-adding them and there was one customer who was basically doing this every night. We saw something like this. So every day at 5 p.m. a certain customer would drop all of their permissions and re-add them and do this over and over again. And we would see this spike in replication lag in our database, which is not good for us. But because we had this kind of dark shift we were able to see this and that gave us an opportunity to optimize abilities even more. So of course in order to fix this we wanted to use our favorite tool, science. And in this case we actually did some science within science. So we had, we already had using abilities itself, science, but then we wanted to test two different code paths within abilities. We wanted a way to do some inference of some things and to not explode things out into the database quite as much. So we tested that out. And we tried it out, it worked. And when we shipped it and we stopped having these problems. So that was a really great way to find that out. So let's take a step back and look at what at this point in the story what have we really done so far? Well we've done a rewrite and a refactor. We've done some science. We've done some data quality repair. We've done some fixes for performance. All right, what's our progress? None of these things are using abilities. We haven't shipped anything yet. But that's okay, once we got to this point we were able to pretty quickly start shipping. So once we were able to fix the last bit of data quality issues, we could start rolling this out. So we rolled it out for organizations first and then for teams. So check, check, we've got two things. We've got a little bit of progress here. I had to do a lot of work upfront to get that but once we got to this point we were able to get things out and using abilities much more quickly. At this point the only thing left is repositories but repositories are the largest and riskiest piece of this. It's the most used part. It's the most complicated and we've been slowly chipping away at the bugs and the data quality issues surrounding it and we're almost to the finish line. In fact, we get to a point where within a week we've only seen one kind of mismatch around a particular data quality issue. And so we said, all right, this is gonna be the last thing we have to fix and then we can ship it. So I code up the pull requests to fix this. The problem was that when users were being deleted they weren't being removed from the teams. So the old record was just sitting around. So I said, all right, let's change the codes. These are always removed from teams upon deletion and then we'll clean up the old data with some of our transitions. So I wrote up this transition. I meant to clean up all the deleted users from teams. Instead I wrote a bad query and I accidentally removed every single repository from every team on GitHub. Oops. Yeah, you can see the outage there. But we were right at the end. Unfortunately science couldn't save me from my own mistakes but we were just about ready to ship, things were pretty much green. And also we have database backups. It's not, not everything is lost but we couldn't instantly restore the backups. So what we decided to do was to turn abilities on at this point to restore access to people, to use the new system a little sooner than I had expected to use it but it was an emergency test run. And we turned it on, we were able to restore access probably an hour or two quicker than we would have before restoring the backups. Once we got the backups restored, I reverted that and turned it back off. Given my previous mistakes I wasn't quite confident enough to really push that into production yet. But this is where we were. Everything was green. It was about time to really ship this. But I wasn't totally confident. I wanted to make sure one more time that everything was okay, that nothing was gonna go wrong when we rolled this out. So I came up with the idea to basically run through every user on GitHub and have a transition that just called, called to the associated repository. So I said for each user give me a list of all the repositories they have access to. And this is a way to basically exercise every single permission in the database. And I figured if we do this, we should be able to tease out any of the last mismatches that we'll find. So I ran this and there were actually no mismatches at all. Well you can see there are three mismatches there, but we had a, there was a particular issue with timing on one of our jobs when you removed a member from a team. It was a common thing that we had seen throughout the time. But there were no mismatches so it was ready to go. It was time to give it to the world and say you can have abilities. So what we did actually instead of just removing all the science code is that we flipped the use block and the try block to begin with so that we could continue having the science around it just in case any mismatches came up. We could still look at the performance so you can see where those graphs kind of flip the candidate and control changing positions basically. And we left that on for a few days and then once we were totally confident that there were no more mismatches and nothing was going wrong, we removed the science completely. So at this point everything's shipped. We've got abilities backing everything. And this was almost a year ago so this has been in production for a year. We've had no problems with it. We've even had people starting to build new things on top of abilities because we made something that was general and flexible. So that was also successful. So we have the code, it's open sourced on GitHub. It's called Scientist. There's a Ruby gem, you can use it as well called Scientist. So I encourage you to go out and use this for your refactorings, rewritings or anything where you really wanna gain confidence. Thank you. Hi, I'm back. Do we have any questions for Jesse? And while you're thinking of a question, I just wanted to point out that a little known fact about her, she has two cats and they're both named after a Disney character. She's a huge fan of the Lion King apparently because one of them is named Nala and the other one Eva after the girl in Wally. Okay, we have a question up on the left. You mentioned that Scientist, one of the caveats was that you can't really test things that write to data, it's only read out. Right. How does science or does it handle model changes? If you're using, say, the same tables, the same model names, but you're slightly adjusting the columns in those models or the fields in those models, can science handle that? And then secondly, the problem that's coming to mind for me is we're attempting to do migration from Rails three to Rails four right now. And this would be incredibly useful for me to say, as I'm moving these things over, let me test it as I go along to see the cases that I've not yet covered or that I still need to work on is science applicable in those situations? So for the first question with the model changes, I think if you were just accessing the different columns, you could probably use it to test different columns. Otherwise, the way we did it was two completely separate models just because if either one are modifying things, you don't want the overlap there, you really can't test the separateness of that. So I'm not sure, I might need a little bit more detail to understand the case, but you may not be able to use it for that. For the second part of the question, doing Rails upgrades, I've never considered using it for that. Do you mean, I mean, what parts of the upgrade would you want to test between? Well, it sort of sounds like, so we suck at testing. So this sort of seems like a kind of a decent way for me to say, okay, I'm now in the process of migrating, let me turn on science to, I know everything's going to go wrong with my Rails 4 implementation at the moment because it doesn't exist. Let me start seeing what these results should be and as I build up my Rails 4 implementation, let me ensure that they're matching what my Rails 3 implementation does, as opposed to just, so something at a much larger scale and not as granular, which kind of seems like the opposite of what science was intended to do, but maybe as fits as well. Yeah, I guess you'd have to hook it in somewhere way far down, which right now it's, it kind of goes inside of your Rails application. I don't, the way the code works, I'm not sure how you would basically have to load up two different Rails applications and have it outside of both of those and running both, which it definitely wasn't designed to do at the moment, you might, you could test out something like that. Thank you. Are there any other questions? Yes, great. So by the time you ship abilities, I assume that you have written a lot of experiments and when the old system goes away, I imagine that there will still be some, potentially some usage of, from all the experiments that has been written at some sort of acceptance test. Do you see something like this or did the already written experiments go away after shipping? So the experiments do go away after shipping. We try to, if there are cases where there aren't enough tests around it, we can use science to compare the results and add more tests in our test suite, but it's not something that you wanna continue running in production forever because you're always basically doing double the work every time you're calling something. So we do it for, as much as we need to, to gather the data and have confidence, but then the experiments tend to go away as soon as you're finished. That's okay, thank you so much.