When to use Feature Flags: Extra-Ordinary Unconfidence

Photo of colorful papel picado banners strung between two buildings with a partly cloudy sky behind.
Photo by Arturo Ochoa / Unsplash

Feature flags are a sine qua non for effective Continuous Deployment (CD)—and yet, they aren't always necessary. Nearly every sprint I've observed for the past 10 years has included the question "should this be feature-flagged?" at least once. Such a common question deserves some framework for answering.

Start with what's normal

What is the normal process for review, merging, and deploying? Let's start with something reasonably common and close to ideal:

  1. Tickets are written with clear goals and acceptance criteria (AC).
  2. Developer proposes a change and code is reviewed. Assuming the change meets the AC—or is an atomic step in that direction—and the change is approved...
  3. The author merges the change into a main integration branch.
  4. Any continuous integration (CI) pipeline steps run. Assuming they succeed...
  5. The change is automatically deployed to a pre-production environment. Assuming the deployment succeeds...
  6. The change is automatically deployed to production.

(Pair programming, the nature of code review, and the very existence of pre-production environments are all their own posts.)

I say this process is close to ideal because it has a only single step where the author may be waiting for someone else. "In production" is a necessary—though not always sufficient—criteria for something to be "done," and any time the author is waiting increases the time it takes for that change to reach production.

It's all about confidence

Confidence comes from a variety of sources: automated (and sometimes manual) tests; monitoring; robust CI/CD tooling; processes or procedures designed to increase it. One of the biggest is repetition, which builds trust in each other as a team (e.g. the designers learn that the developers really are sticking to designs, the developers learn that the product manager is consistent and clear with tickets, etc) and builds trust in the processes and tooling.

From time to time, a change comes up that pushes the limits of how much we trust those systems. We've built these tools and processes as a safety net, but no safety net is fool proof. And that's when feature flags come in.

Feature flagging, at it's root, gives us an opportunity to ship and deploy code without releasing it. We can turn the flag on for some testing audience in production, or even pre-production, and close the confidence gap before we roll the flag out to the rest of our users.

Experiments, big features, and fear

Two examples are hiding in plain sight: customer-facing experiments or A/B tests, and large features.

In the case of a customer-facing A/B test, since we don't know the winning outcome, none of our automated testing will be able to ensure our code is doing the "correct" thing. "Correct" is up in the air! So we add a type of feature flag, and measure. That is how we establish confidence that we're going down the right path.

When we're planning a large feature, we usually recognize the need to use a feature flag because it isn't going to be done in a single atomic change or ticket. The feature will require time to build out—and it will require more exploratory testing (the best kind of human testing). We would not have the confidence to land even a medium-sized feature all at once, so we reduce the risk by decomposing it into a series of changes. In fact, we're confident that those individual changes would not be correct, since each one only brings a small part of correctness.

(I have seen several teams break down a feature into smaller pieces in individual tickets, and then try to have a human acceptance test every piece. This has never made a lot of sense to me for two reasons. First, when engineering slices up a large feature into smaller parts and intermediate goals, those aren't always testable by a person at all. Either everything will break if you turn on the flag too early—as long as it breaks in the right way—or internal changes are required. Second, this almost always requires the engineer to remind the person testing the change about what's not included yet. How can a designer assess the correctness of a design if only a few elements are pulled together? Or a product manager assess the correctness of behaviors if the things that make use of those behaviors don't exist yet? But this is its own post.)

The last major case is, frankly, vibes. I don't know all the fragile parts of your systems, but your team does. Is someone nervous about the change? Is it a reasonable fear? Better safe than sorry: feature flag it. If the engineer who has been around for a while thinks touching this part is likely to be dangerous, then put a feature flag in front of it.

So, a framework?

When the question "should we feature flag this?" comes up, this is how I reframe it:

Is this change risky in a way that our normal processes (development, review, CI, etc) do not give us confidence that we're shipping correctly-behaving software?

Often the answer is yes—a hint is that someone asked at all—and that's OK. If your team is getting used to CD, before the trust is built-up, you'll err on the side of creating more flags. That's good practice!

If you've been practicing CD for a while and find yourself answering "yes" more often than you'd like, there are some follow up questions:

  • Why do these changes make us more nervous than normal?
  • Do these cases have anything in common? Is there a particular part of the code base that makes us nervous?
  • How can we augment our usual process so that we can improve our confidence? Can we improve automated test coverage?
  • Is creating "too many" feature flags actually a problem?

If you find a specific area of the code base to be more risky than others, that is a good place to focus during sustaining engineering work. If the risk is concentrated in one layer, like database access or UI design, it may be worth investing in new or improved testing or monitoring infrastructure—or it may be that changes in certain layers really should be behind feature flags most of the time to allow human testing in production.

The common factor might also be a person. This could be someone who tends toward nervousness, or has been burned by gaps in the automated infrastructure recently or badly. That person may personally have lower confidence in the systems overall. It could also be someone whose work doesn't, well, inspire confidence, to put it delicately. One way or another, addressing this probably requires some one-on-one conversations.

If you are creating a lot of feature flags and feel like it's "too many," dig into that feeling. If the feature flags are leading to code that is so complex, or you spend so much time on feature flag clean-up—which you should do!—that it impairs your team's ability to deliver value, then it may, indeed, be too many. Look for ways to reduce the number of feature flags, possibly by combining them or by improving the development and deployment processes.

On the other hand, if your team creates and deletes a large number of feature flags and it is not preventing you from delivering value, then that isn't a problem. That's the process doing its job and working as intended.