Advanced Blamelessness is Owning Mistakes without Fear
Blame happens when we stop our investigation with operator error.
Then James restarted the server, which broke everything.
The first step towards a blameless culture is to stop blaming people for things going wrong. (Assuming no malicious actors.)
Then I restarted the server, which I should've known was going to break everything. My bad.
The second step is to stop blaming yourself. This is much harder.
The server was restarted, which flushed the caches and led to a thundering herd...
The third step is getting to a place where you can comfortably assign ownership to actions without fear of it being a bad thing.
James restarted the server, which broke everything.
Why did James do that? What led him to believe it was the right solution? What signals would have pointed him in a better direction? How are the caches set up? Can the thundering herd be addressed?
Operator error is something you can and should examine and address. As long as its a starting point, not the end, of the questions, it should be OK to own up to making the error.
This is hard. It's hard to accept that we make mistakes—which we do, because we're human, at a fairly predictable rate. It's even harder to accept that we do, and we will continue to. Once we can accept that, however, it becomes obvious that we need to build systems—software, processes, whatever—that assume operator errors will happen. If we assume operator errors are an immutable reality, then the gaps to close can't be with the operator, they have to be in the system.