I am no fan of "Root Cause Analysis." It's one of the many terms or concepts that software has borrowed from manufacturing—while ignoring the subsequent 70 years of research into human factors and resilience engineering.
However, like Lorin, I admit that it has some value during incident response. It's jargon, or an "improper noun": it means something more than is contained in the colloquial words. Typically, we're looking for the proximal cause. What's the most recent thing that changed? What specific input violated some constraint?
So I very much appreciated this post by Casey Rosenthal, which I read this morning after having spent a few hours fighting production fires with several on-call engineers yesterday (and more than a few who weren't, but jumped in anyway). In particular, the term "Least Effort to Remediate," or "LER," stuck with me.
When we're looking for the "root cause" during an incident, we're really looking for the Least Effort to Remediate. Undoing the proximal cause (e.g. the configuration that changed, the new host in a kubernetes node pool, the recent deploy) is typically the LER. Not always, but often enough that it's a pretty good place to start.
What I especially like about LER vs the jargon version of "root cause," (setting aside that "root cause" requires active containment to this context) is that "cause" is backward-looking and often inherently linked to blame: "What just happened? Did anyone do anything?" LER, however, is more forward-looking: "What can we do right now?"