Yesterday, I wrote about how a three-legged deploy process enables migrations in continuous deployment (CD). Today I want to talk about the creativity that hides in the gap between "a little" and "zero."
One way to look for opportunities—I think attributable to Marty Cagan—to innovate is to take some metric and ask what it would take to change it by a factor of 10. In engineering, this often looks like dividing by 10. How can we reduce the time or effort in this process by an order of magnitude?
But dividing by 10, no matter how many times you do it, will never get you to zero. Zero takes some kind of leap. What does it take, not to improve or optimize a process, but to completely eliminate it?
Years ago, I had a process for canceling recurring credit card subscriptions that involved SSHing into a running machine, opening a Python interpreter, and running several commands—to get the right user, to check the state of their subscription, to execute the business logic to cancel it, etc. All of these steps were opportunities for error, and they took at least a few minutes. Eventually I did manage to encapsulate it so it was fewer commands, but the risk was still there.
Then I added it to the admin UI. All of a sudden I didn't have to drop into "engineer mode" from "customer service mode" anymore. I could stay focused on churning through customer service requests—and I started dreading them less.
Teams see even bigger benefits from changes like this. On one of my teams, a process required a non-technical team member to raise a request, that request to get picked up by an engineer, that engineer to run potentially destructive code by hand, until we asked "how can we take ourselves out of this entirely?" The answer was to add UI that enabled the non-technical team members to handle the task themselves. Not only did it reduce the risk and need for access to production, it also sped up the overall process by at least an order of magnitude. What used to take days—mostly of inactive time—could now be done in one step while on the phone with a customer.
Downtime is similar. Engineers will look for ways to minimize disruption. You'll hear things like "these need to go out at the same time" or "it shouldn't take more than a minute." Minimization techniques won't get this to zero downtime—though they might get you close enough for your needs—but flipping the question around will.
When I ran TodaysMeet, a live chat application, I wanted to deploy during the day, while I was awake. And deploys would cause customers to experience errors. Usually only a few, but still not a good user experience. The solution came from turning one layer of the architecture into a mesh—routing requests from local processes to neighboring copies—building in queues that could buffer, and drain, requests if the next layer was unavailable; and some optimistic UI updates.
Making the downtime window shorter would have reduced the likelihood of requests failing, but eliminating it completely provided a much better user experience, and made me feel comfortable deploying whenever, even during peak traffic.
When there's a common task, something takes a large percentage of your time, or a disruption, instead of asking how to minimize it, ask, "what it would take to eliminate this?"