 Rebecca Slider, and I'm an engineering manager at Kickstarter. Before joining Kickstarter, I was a consultant, which meant that I traveled to clients of all shapes and sizes to talk about software development practices. Because of that, I've grown to care a lot about tooling and practices that help teams deliver better code more efficiently. Over the last few years in particular, I've come to appreciate simple code that can be understood and extended by my teammates. In a nutshell, to me, development should be as simple as possible. I try to stay away from writing clever code, and a lot of my time is actually spent thinking about the best possible way to communicate an idea, domain, or tool across the development team. Which brings me to the point of my talk. This is a tale of two feature flags. Specifically, what happens when a team implements a bunch of features, puts them behind flags, and ships them? Today, we'll be thinking about what makes a flag successful and comparing it to other feature flagging strategies that haven't worked out so well. So let's talk feature flags. This is the opening of the book, A Tale of Two Cities, and I think it really nicely frames what I'm gonna talk about today. Software development, in many senses, has never been so good. Over the last few years in particular, we've developed languages, patterns, and release strategies that abstract away a ton of the grunt work. And let's us focus on the good stuff. It really is the best of times. On the other hand, though, all of these conventions, patterns, languages, what have you, they can be misused. In the wrong hands, the best of times really just means that we can release terrible code faster than ever. And feature flags really exemplify both the best and worst of software development. They're designed to make our lives easier, but often they do anything but. But I'm getting ahead of myself. Let's focus on the good things for a moment. What exactly is a feature flag and why are they so powerful? Well, features start off with an idea, right? Everyone's ready to go jazzed up to build something great and deliver value to real live users. But then they're hit with a couple of really hard truths. First of all, writing software is hard work. Developing a feature from beginning to end takes a ton of development time, not to mention orchestration across the entire team. Ideas take work, turns out. And once all this hard development work has been done, typically you're sitting there with this big, bad, disworth of work, just sitting there, ready to go out. And it's really risky to release that. The code's been tested, sure, but there are a ton of moving pieces and some small bugs are bound to slip through. And in thinking critically about this sort of big bang release, you're probably familiar with the type where all of a sudden you ship a ton of code, flip the lights on, and everything looks totally different. You realize that you're not sure how this feature is gonna perform with those real live users. What would be ideal would be to test your idea on a small subset of users and see how they react to it. This is difficult to do though, with a gigantic monolithic release. And finally, going back to an earlier point, there are a ton of moving pieces in what you've built. Beyond the things that you yourself or your teammates have implemented, perhaps you're relying on a totally new data store or a third party API. The point is, releasing this all at once is risky, especially when you're not sure how your dependencies are gonna scale. And you can communicate this to the business all you want, right? They can have all the grand ideas they want, and you get to be the killjoy. It's really hard, it's a ton of work. I'm not sure how the system's gonna perform with users or with our current infrastructure, and they'll say, fine. That's nice, so let's go back to my really nice big feature idea. How can we minimize the risk of releasing this grand idea that I have? And we all know that the answer is to put it behind a feature flag. At a high level, flags are configured in their own files, typically YAML files, where you set the features to honor off. And these files usually start out pretty naively, just a flat file with Boolean values, right? Big feature off, maybe some other features are on, that's how they're configured. And then, throughout your code base, rather than just implementing the feature, you query that Boolean config value and decide whether the user should see the feature or just the other existing functionality. Depending on the state of the flag, the user will see one of two things. Either the existing UI plus the brand new feature that's behind the flag turned on, or just the plain old UI. It's as if the feature doesn't exist to them. And when you think about it, flags solve a lot of the concerns that I mentioned earlier. Well, if something is a lot of work, then you can release it iteratively. You can send your code out behind an off flag as you develop it. You don't have to worry about big bang deploys or specking out months of work up front. You can just release as you go. Once the feature is ready to be shown to users, you can extend your flag to only be accessible to a small group of those people. For example, maybe in the beginning, you only want to show the feature to internal employees for testing. That's possible with feature flags. And as you gain confidence in the system, you can slowly increase who sees the feature until finally, you've turned the flag on for everyone, and you get to go back through and remove the flag and the conditional logic. And the feature is just on for everyone, which is pretty exciting, right? And as you might know, it really just gets better from there. I mentioned earlier this idea of finally grain toggling between groups of users, right? So beyond even just doing internal only, you can target maybe only French speakers or only people within a certain mileage of San Antonio. A really cool thing about feature flags is that they can be really easily extended to A.B. testing, right? With a flag, you're either showing one thing to some users or another thing to another group of users. And then once you've got that, all you need to do to make it an A.B. test is just measure what happens afterwards. So it's a pretty cool natural extension of a feature flag. And finally, a lot of teams use feature flags as kill switches. So when a part of their system fails or a feature isn't performing as they expected, they're able to just turn off that part of the system without taking the rest of the application down. A good example of this is a page that might be experiencing really high load because people are pinging it and adding a ton of comments to a blog post or something. And rather than just taking the whole page down, right? Or the site breaking somehow, you can just turn off the comments feature and make it read only. So, these are all some pretty cool natural extensions of feature flags. Which just sounds like the best of times, but could go wrong, right? So, let's talk about what happens in practice when a team decides to ship code behind a feature flag. We always start out with the best of intentions, right? So, I worked on a project last fall where we were shipping something called faceted search. It was for a retail client and the search results would come from different sources. Each source was considered a facet. These sources were independent services and the client didn't have a ton of confidence in the performance of these services. So, we decided to put our feature behind a feature flag. And that's how we started, a feature with enough ambiguity and dependencies to warrant putting it behind a flag for testing purposes. Always will. Or at least, always well until the inevitable scope change came up. The external dependency wasn't ready and we needed to move on to the next facet. And unfortunately, that facet needed to be behind a totally separate feature flag so that we could deploy the features completely independently of each other. And that's how our tidy, well-defined feature flag became two feature flags. In theory, this doesn't sound so bad. Two facets, one flag each. Toggling one facet or feature on or off would be totally independent of the other, until we started writing the code. We had no idea what the user should see if one feature was toggled on, but not the other, and vice versa. These flags, which were seemingly independent, were actually really tightly coupled to each other, just by virtue of them being associated with the same broader feature, search. My pair and I spent so much time speaking to various stakeholders that we ended up drawing a truth table to make sure that everyone was on the same page in terms of business logic. And the truth table itself, it started out easy enough. Both features on meant that both features would be shown in the UI. Pretty simple. And only one feature on meant that only that one feature would be displayed. Fine. Then it started to get confusing. Where on the page would each feature be shown? And what if the placement of even just one of these features was dependent on the rest of the UI being fixed? This is ridiculous. I mean, sure. It gave us a spec of exactly what the page was supposed to look like under every single possible condition. And it was helpful when we completed the work and we were ready for a QA on our team to look at it. But it was ridiculous. The feature was so complex and testing it was so challenging that we were bound to introduce bugs. And so we did. There was a ton of back and forth with this QA on our piece of work, simply because no one was exactly sure what the user was supposed to see. Which brings me to my first point. That treating independent flags as a way to release a single feature iteratively is really challenging. And at the time, about a year ago, our team realized that we had a couple of options, right? We could either scrap the project, push back on PM, and just say, like, we needed to wait until the service was ready. Which at the time didn't seem very realistic. So we went with option two, which was to go ahead but be really careful about testing this feature. And that sounds fine, except for the fact that you can't really automate carefulness, right? Especially in manual testing. Something's always going to slip through. And all of this time spent clicking through manual testing things, it got tiring. And we realized that all of this time spent testing was actually time that we weren't getting feedback from our client or from real users. Which we thought was the whole point of putting something behind a flag and getting it out there. So this is kind of a messy situation. And what we realized is that a key thing about feature flags is that they ought to be simple. They shouldn't be adding a ton of complexity to your app. In fact, they should be just like very simple Boolean values wrapping the complexity. You shouldn't have to think a ton about implementing them or working with them. And if you're spending a lot of time implementing the flag, rather than the feature it wraps, feature flags probably aren't going to solve the problem that you have. In fact, you might even be using these flags where they shouldn't be used. So the moral of my faceted search result story is that we quickly learned that we should stick to one flag at a time for any given feature. Because even when you manually QA the hell out of something, bugs are bound to happen. So we might as well go with something that's simple enough to allow us to focus on implementing our feature as opposed to the flag. So now let's go a little further with one of those flags and talk about how it was actually implemented. The search feature that I mentioned earlier was delivering different kinds of results that we called facets, right? And we ended up placing this flag for this facet in the back end, which at the time seemed okay. We deliver search results to the UI without having to toggle based on the flag's value there. The UI would be feature flag agnostic, which we thought was pretty cool. I have some pseudocode up of what our approach looked like. In order for us not to tell the front end that we were making back end changes, we ended up appending logic to the existing back end method that performed the query. And that was fine in the short term. It was temporary code anyway, right? Over time, however, these flags started spreading throughout the code base. As other teams implemented other features, we realized that each team had a totally different implementation strategy. And for us, with our back end strategy, unit testing was particularly difficult. We had to tell each test about the state of our flag. These flags were temporary, which meant that we spent a ton of time writing what were supposed to be small, isolated tests. In the end, these tests ended up knowing a lot more about the system than what a unit test traditionally should. It doesn't work. It's sad. So removing toggles is really hard with this strategy. We had to go back and update back end code once again after the toggle was dialed up to 100%. This meant that we spent a lot more time cleaning up the code because both the logic and those unit tests that I mentioned were coupled to the flag concept. And the worst part about this was that we were breaking tests as we went back through and removed the flag, right? Because all of our unit tests had to know about the flag state. So this wasn't actually just a very simple refactoring because we weren't in a state where all of our tests were giving us any confidence. In fact, what we were doing in removing that flag was more like feature development, which just meant more opportunity for bugs and errors and more necessity for additional testing. Removing a flag had become really time intensive for our team. And this was the absolute best case wherein a team takes the time afterwards to clean up the flag once it's ready. Often, engineers move on to a new feature before having the opportunity to remove the flag. It's saved for later. The team becomes scared to remove it as it probably flags some critical piece of functionality. There's the GIF. So this is how flags that were intended to be temporary become immortalized as kill switch flags. The team doesn't really know how to clean it up or maybe just figures that such an important part of the code base that's placed behind a flag of this gravity deserves this kind of finely grand control. What's terrible about this is that the code wasn't necessarily written in such a way that it should stick around forever. It's just needless complexity, conditional statements, config values that exist in your code base forever. It's additional mental cycles that a developer needs to spend every time they pull up a file touching the flag. This isn't to say, though, that all kill switches are always bad. Sometimes they work really well. Stack overflow, for example, uses a kill switch to disable posting questions and answers when they're undergoing maintenance. These kill switches can come in really handy, which we'll go into a little more later. But what's important to note is that this kill switch was architected to stick around permanently. It's a whole different beast from a more temporary feature flag. So in thinking about flagging strategies for these temporary feature flags, I have a couple recommendations. My first is that your team should come up with an approach for isolating the flag. It shouldn't be used across the code base. In fact, ideally, it should only be referenced in a single place. Doing so will lessen confusion and room for error when you or your teammates go to remove the flag. If we take this idea of feature flag isolation a little further, I'd recommend that the team adopt the habit of treating flagged code as new code that is completely separate from the existing product. Again, this encourages ease of removal. Beyond that, it clarifies exactly what a feature flag is supposed to do and precisely what the state of the product will be once the flag is ripped out. And the chance your team decides not to release the feature to everyone, it will be obvious exactly what will need to be removed. There'll be less wrestling with the code in that case and less wrestling means decreased chance of cruft or bugs creeping in. So at this point, we've talked about the genesis of a feature flag and how it can warp before it makes it into production. And the kinds of code we write when shipping flags. Usually this is the end of the line for a flag. We ship a feature, turn the flag on and we're done. Oftentimes we forget about the flag or the person who wrote it changes projects or maybe leaves the company entirely. Sometimes we tell ourselves that we'll just leave it in there for some period of time just to ensure that the feature it's wrapping performs correctly. We'll come back to it later, it's not a big deal. Most code bases end up with a config file thousands of lines long denoting flags that have long outlived their utility. This is gross. These flags left unattended represent mental energy that the team has to spend every time they touch a file dealing with the flag. They're maintenance overhead. So sure, these flags are technical debt. But beyond technical debt, these flags are risky. Their configuration and potentially code that was engineered to be a temporary thing. They probably haven't been tested in the way that other code has and they probably aren't understood by the current team members. A while back a financial services organization called Knight Capital developed a tool called Powerpeg that was built to mimic changes in stock prices in order to test their trading algorithms so that this code didn't actually run on the real stock market. They placed it behind a feature flag and eventually turned it off in all environments. It was turned off for eight years, which kind of makes sense, right? You write something risky, the maintainers leave, it gets turned off. No one wants to remove it years later. So instead they just end up building stuff around it. They don't want to touch it, it's not theirs, they don't know how it works, makes sense. Years later, another team in a whole different part of Knight wrote some code, threw a feature flag on it and shipped the code. What they didn't realize was that their flag had the same name as the flag that kept Powerpeg turned off. When the stock market opened the next morning, the defective Powerpeg code caused Knight to send millions of orders into the stock market. The system didn't have a kill switch and they did manual deployments. So over the course of 45 minutes, Knight Capital sunk $460 million into incorrectly priced stocks. Eventually, the New York Stock Exchange stepped in and triggered a circuit breaker to halt all trading on these stocks. This debacle caused Knight Capital's stock price to collapse and they never really bounced back from this. They were acquired a couple months later, which is a pretty hellish story. Obviously, this is something that we all like to prevent from happening in our own production environments. And in thinking about how exactly this had happened, what it comes down to is that there was a long-lived flag that didn't need to be in production anymore. No one was maintaining it, it was just legacy code that no one wanted to touch. And long-lived feature flags are a relatively common problem, right? I mentioned earlier that tons of organizations have code bases that have thousands lines long config files. And conventional wisdom with this issue is that you can just make the problem more visible. The thinking is that visibility equates to empathy, which eventually means more buy-in across the organization for a time or people to solve the problem, in this case, go in and remove the flags. So typically, folks will try to slap this kind of thing on a dashboard. And sometimes it'll even be put on a screen around the office for the whole team to look at. And this is fine, it's a good first step, but to me, it lacks agency. Everyone can see that you have an old flag, but if it's been sitting around in the code for a while, chances are that that's not gonna come as a surprise to anyone on your team, right? They're pulling up the code base every day and looking at those flags. It's more likely that most folks just probably don't wanna remove it for whatever reason. So what you really need is a forcing function or a way to delegate the problem to a specific person. Let's go back to the flag configuration file I mentioned earlier, which was just key value pairs, right, this flag on, that other flag off, just that whole flat file of Boolean values. The solution that we came to at Kickstarter was to implement additional fields for each feature flag. In addition to on or off, we added a deadline field, which specified exactly when that flag should be removed. And we also added a maintainer field, which delegated exactly who was responsible for the flag's eventual removal. On our CI server, every time tests are run, we then check if the flag is past its, if each flag actually is past its deadline. If it is, we create a new branch, amend the config file to contain a message to the maintainer and commit the branch, CCing the maintainer and the commit message. We use the GitHub flow really heavily at Kickstarter, which means that the maintainer soon gets an email that she was CC'd on a commit. And, you know, mistakes happen. Maybe the maintainer's on vacation or maybe she's left the project or the organization. You know, maybe this branching thing like doesn't always work. So, in that case, our CI server, if the branch exists and has been around for a while, which in our case is more than two days, we commit again. And instead of CCing the maintainer, we escalate it and CC the entire team. This works for us, but this strategy is admittedly like pretty Kickstarter-specific, right? I'm talking about our CI server, our GitHub workflow, all that. But I think that the approach of identifying a maintainer and a deadline can take a lot of different shapes, depending on your individual code base and your team culture. So, beyond a dashboard or visualization, my recommendation for teams trying to avoid long-lived feature flags is to define a lifespan for each flag. The flag and its cleanup should be delegated to an individual or team in order to ensure it's removal. Most importantly, teams should be ruthless about deleting these flags that have just outlived their utility. I think we can all agree that feature flags introduce complexity and overhead into a code base with the end goal of getting something out faster. There are obvious trade-offs here. Feature flags, at the end of the day, they create technical debt. But they also help us achieve some really great things, primarily releasing better code faster and more iteratively. But without careful consideration, the damage that feature flags can wreak on a code base or even on a production environment can quickly outweigh the benefits. So, a team has to really think about what it means to do them right. Get the worst stuff out of the way and focus on the best of times, which, to me, is leveraging these flags to ship great code.