Sliding Windows of Compatibility
Writing software professionally means constantly changing, evolving, and migrating applications. All-new "greenfield" projects are a rare treat, but not the norm. Constant change means we have to be comfortable making big changes applications that are running in production environments, usually with customers using them, and do it safely, without downtime or incidents.
The bigger the change, though, the more risk—of downtime, of disruption to team members, of getting shallow code reviews, of needing to roll back and fix something, somewhere.
So one of the ways we can make big changes safer is to break them down into a set of smaller changes, each of which is compatible with the ones around it, but not necessarily the ones further in the past or future.
What does this look like in practice?
Here's an example. Let's say we have a logging API, but it's a simple one, something like:
export const log = (message: string): void => {
console.log(JSON.stringify({
time: (new Date).toISOString(),
message
}));
}
And we want to add a log level parameter. We want every call site to set this explicitly. For the sake of the example, we'd prefer the level to be the first parameter. Eventually, we want the signature to look like:
export const DEBUG = "DEBUG";
export const INFO = "INFO";
export const ERROR = "ERROR";
export type Level = typeof DEBUG | typeof INFO | typeof ERROR;
export const log = (level: Level, message: string): void => {
There are a couple of approaches we could take. One is to change the method and all of it's call sites in a single, possibly large commit. The other is to spread this out using a three-legged deploy—possibly with a stretched out second leg.
Big changes are riskier for a few reasons:
- The more changes at once, the harder it is to attribute any issues to the right part of the change.
- Large, single changes can't be partially rolled back—it's either undo the whole thing or fix forward.
- It's harder for a human to understand the full scope of the change, making it more likely that something is missed.
- Code reviewers have a harder time engaging with the bigger change, leading to more perfunctory, "rubber-stamp" reviews, and less knowledge shared across the team.
- Large changes—particularly those that touch a lot of files—are more likely to cause merge conflicts and rework for teammates.
Instead, we can can decompose this change into a three-legged deploy. If we require the first leg (support the new thing) to be backwards compatible, the first change might look like this instead:
export const log = (level: Level | string, message?: string): void => {
// If we only get one argument, assume that argument is the
// log message and set a default log level
if (!message) {
message = level as string;
level = INFO;
}
console.log({
This isn't our ideal state: it introduces complexity into the code, means we need to test more code paths, and could be confusing for developers since there are two ways to call log()
. But it gives us an intermediate state that doesn't require us to change everything at once, or block any work already in progress by our team.
If there are a lot of call sites, which is likely with something like logging, we can start updating them in groups. We can pick the areas of the code base to focus on by avoiding where our coworkers are working, reducing the risk of merge conflicts and rework.
To ensure we actually finish this piecemeal change, we can instrument our fallback code, using a deprecation warning or counter metric:
if (!message) {
message = level as string;
level = INFO;
if (process.env.NODE_ENV !== 'production') {
console.warn('deprecated: log called without log level. add an explicit level argument');
}
}
As we update the callers, we'll have fewer and fewer of these deprecation warnings, until we're not seeing any. Each change is also compatible with the one before and after: the log()
API still supports both old- and new-style calls, and we have old- and new-style calls in the code base. Until we don't.
Once we've changed the last log()
calls, we only have new-style, two-argument calls. Since this is an application and not a library (where we might use semantic versioning to communicate the breaking change) it's no longer "breaking" to remove support for the old-style calls: we're removing completely unused code paths. That state of the code means that our new function signature is compatible with the changes immediately around it:
- Commit 1: introduce backwards-compatible change to
log()
method. All calls are old-style. (First leg.) - Commit 2: Change first group of
log()
calls. (Second leg.) - Commit ...: Keep changing groups of
log()
calls. (More second leg.) - Commit n-1: Change last group of
log()
calls to new-style. (Still more second leg.) - Commit n: Remove support for old-style. (Third leg.)
The last commit, n, (it's old math habits that make me refer to it as n) is not compatible with commit 1, but that's ok. It's compatible with commit n-1. As of commit n, we've got our new, desired logging API, and no individual change in the process had to be large or breaking.
Why bother?
One reason to avoid large, sweeping changes is that it makes the migration easier on our teammates.
If we made the change as a single, breaking change, any existing branches created from the pre-change version would need to update all of their log()
calls after the post-change version is merged. By supporting both types of calls for a while, we allow our colleagues to finish any work in progress without needing to revisit their calls to log()
, while also allowing them to start using the new style sooner, since they don't have to wait for us to complete the big change. That even means they can help by updating calls as they are working on other tasks!
In the example, we can have other commits or changes that happen concurrently with our migration:
- Commit 1, the first leg.
- Commit A, something completely unrelated, using either the old or (preferably) new style.
- Commit 2, updating a group of calls.
- Commit B, a different change, using the new style.
- Commit 3, more
log()
updates. - Commit D, a coworker comes back from vacation and merges a change from two weeks ago, before commit 1, so it uses the old style. (Not great, but it lets them ship and do a follow-up change, rather than blocking them.)
- Etc...
Like everything, using this sliding-windows approach involves making trade-offs. It needs communication or tooling to keep people from adding more old-style calls, or it can start to feel Sisyphean. It requires team-level discipline to complete the change and not leave the code in a mixed state. It might take more time—certainly more code-review cycles. On the other hand, each of those review cycles is smaller, faster, and easier to understand. The risk of any individual change is much lower, partly because reviewers are less likely to miss things, so we're less likely to have an incident—and if there is an issue from any one step, it's easier to find the culprit. It also means less disruption to the other members of the team, whether that's merge conflicts or reworking something that's in progress.
With less disruption and risk, while it may take longer to complete the three-legged deploy, it may not take more of the team's time, overall.
- Working in smaller chunks usually means less cumulative time spent in code reviews, and faster turnaround time since the review doesn't take as much dedicate time.
- By picking non-overlapping groups of call sites to update, people working on the migration can spend more time on task and less with branch management.
- We can also work around other changes that might be happening concurrently to avoid merge conflicts.
- Letting teammates to opt into the new style earlier means less total rework.
- And the work can be parallelized—across multiple people working on the migration, by other team members updating as they go, or both!