Zero Downtime and the Art of Continuous Deployment
I've been thinking a lot about the relationship between availability and continuous deployment (when I say "CD," I usually mean "deployment," all the way to production). One of the primary concerns people raise about CD is the perceived risk to availability or uptime. But there are patterns we can adopt that not only minimize this risk but actually improve the availability of the system, because we're being deliberate about it.
Three-Legged Deploys
The general pattern is:
- Support the new thing.
- Migrate to the new thing.
- Remove the old thing.
This looks simple and, conceptually, it is—though the details and specific cases can be much more complex. It's such a low-level pattern that I haven't seen a name for it, so I'll use "Three-Legged Deploy."
Example: switching to a new API
For a straight-forward application of this pattern, let's consider a consumer application that calls an internal API. For whatever reasons, we're unhappy with the current internal API design and plan to move to a v2.
The first step is to implement the new thing, in this case the v2 API, even though nothing is calling it yet. Not necessarily the whole thing, but let's say we want to start by moving GET /get_widgets
to GET /v2/widgets
. So, in our widgets
service, we'll implement this new API endpoint.
Now we've got our new endpoint, we can do whatever amount of testing we need without putting traffic to our consumer application at risk.
When we're confident in the new endpoint, we move onto step 2: migrate to the new thing. In our consumer application, we can add a feature flag:
var widgets []models.Widget
if flags.Enabled(request, "release_widgets_v2_get") {
resp := http.Get("https://widgets.svc.local/v2/widgets")
widgets, err = parseResp(resp)
// ... etc
} else {
resp := http.Get("https://widgets.svc.local/get_widgets")
widgets, err = oldParseResp(resp)
// ... etc
}
Since we're not trying to change the observable behavior, we might roll this out as a slowly-increasing percentage of randomly-selected traffic, to make sure that it handles the realities of production.
Sooner than later—hopefully!—we'll be fully rolled out, and then we can move to the third step: removing the old thing. In this case there are two places we need to clean up. We will remove the flag from our consumer application and then remove the old API route from our widgets
service.
A more complex example: changing how data is stored
In the previous example, "migrate" could have been replaced by "use." What happens if we actually need to migrate something?
Let's say we've built a storefront with customer ratings. In the early days, keeping customer ratings in a ratings
table was the simple and fast option. Now, as our traffic has increased, we've noticed that keeping everything in the same database is becoming a bottleneck—we can only scale that DB up so far—and so we're going to move the ratings into a new service to help distribute that load horizontally.
For our first step, we support the new thing: build and test a ratings
service.
But wait, our new ratings
service doesn't have any data in it! We've got years of data to move, which could take hours in a straight-forward migration. During that time, where would we store new ratings? The easiest answer might be to prevent customers from submitting new ratings, but we can do better by aiming for zero downtime.
Instead, we might do an online migration to the new thing. Currently, every new rating created is adding to our backlog of data to migrate. To stop the situation from getting worse, in our web app, we might have code like:
def create_rating
ratings_service.create_rating(product: product.sku, stars: stars)
CustomerRatings.create!(product_id: product.id, stars: stars)
end
Writing new ratings to both places ("double-writing") means that we won't have to migrate any new data.
For the existing data, one option is to migrate it on-demand, when we encounter it:
def get_ratings
ratings = ratings_service.get_ratings(product: product.sku)
old_ratings = CustomerRatings.where(product_id: product.id)
if ratings.length < old_ratings.length
ratings_service.create_ratings(product: product.sku, ratings)
ratings = old_ratings
end
ratings
end
Migrating the entire ratings
table might take a long time, but migrating the ratings for a single product at a time may only take a few dozen milliseconds. (I'm leaving out protections for the "thundering herd" problem for now.) There is a temporary increase in latency for a small number of requests, but then future requests for that product will use the ratings
service.
If we're not happy with the potential increase in latency, we can also handle the migration step asynchronously, likely in batch jobs or by product.
This online migration can work great for popular products, but it might leave a long tail of unmigrated data. We can schedule jobs to backfill the remaining ratings in batches at a pace that never put too much load on the servers.
If ratings are editable, we'll also need to handle update_rating
, which is likely to look similar to create_rating
.
Eventually, once our remaining data is migrated and we're not seeing any more migration actions, we can remove the old thing by dropping fallback reads and the double writes.
Making the trade-off
The three-legged deploy usually takes longer: there are more steps, and spread out over more time. In exchange, you can avoid downtime, deploy at any time, and reduce the risk at every step. Is that worth it? It depends. (I couldn't call myself a senior engineer without putting that in at least once.)
Defaulting to the three-legged, zero-downtime approach usually means you're estimating the work based on the safest but most expensive approach. If the cost is low, as in the "new API" example, that may be fine. If the cost is higher, as in the "data migration" example, you may need to consider a number of different inputs, like:
- What is the cost (in dollars, trust, etc) of downtime?
- How confident are we in the new thing handling production load?
- Are we going to need to have people working outside of normal hours? Will that introduce risk?
- Is the "with downtime" approach significantly cheaper to implement?
Is it worth it? Only you and your team can make that call.