Building High-Reliability by Giving Ourselves an "Out"

Have you ever experienced an error or failure where in hindsight the chain of events seemed obvious, but during the process while the error or failure was occurring you did not recognize what was about to happen? You know the old saying, “hindsight is 20/20?” Of course hindsight appears to be 20/20 because we know how the story ended and it may be easy to say we should have known better or should have paid more attention. The reality is, though, that in complex systems (and I would argue most organizations are complex systems) the way errors, accidents, mishaps, and/or failures may occur may not be as simple to predict as we would like to imagine. In fact, some would argue that accidents might be a normal outcome in complex systems. Normal Accident Theory suggests that accidents are a normal part of complex systems and are often organizational accidents stemming from multiple failures. You may have heard of “black swan” events, where seemingly unknowable risks unfolded to cause major catastrophe. Sure, in hindsight perhaps they were not seen as black swans, but for those working within the system and organization, within a specific context these may have been black swan events to them because they did not recognize what was coming. 

Why do black swan events or unrecognized failures happen? There may be multiple reasons, and rather than asking why they occur, perhaps a more useful approach may be to examine the complex organization and how safety or reliability is created. We can start by realizing that safety (and perhaps reliability) is an emergent property of complex systems and organizations. This means that we cannot predict overall failure by the failure of one component. What happens at the individual level may not be a good predictor of what happens at the system or organizational level. If we only look at the failure of one person at his or her job and don’t think about the ripple effects that failure could have we may miss opportunities for identifying mitigation strategies. Additionally, I believe that we can never truly predict all types of failure, and there will always be a level of unknown unknown risks (we can’t imagine them so we can’t mitigate them specifically).

If this is true, should we just throw our hands up and simply give up as we wait for failure? Of course we shouldn’t, and I believe we can develop management cultures where we seek highly reliable performance and develop actions to help mitigate failure even if we can’t imagine what that failure might be. One approach may be to help design more loosely coupled systems, which allow a degree of flexibility and recovery options in the event of failure. Let’s look at a simple example to demonstrated the difference between a tightly coupled system and a loosely coupled system:

  • Tight coupling: A single generator supplies power to two systems. If that one generator fails both systems will fail.
  • Loose coupling: Two generators supply power independently to each of the two systems so that if the first generator fails there will still be power available to the second system.

This isn’t to say that both generators won’t fail, but it is a way to design looser coupling so that if a failure occurs, there are backup options. Slack in a system may be another example. Rather than following a critical path where resources are dedicated to a project to complete it in minimal time, with no room for failure, allowing slack in the system in case something (such as unforeseen risk) occurs there will be additional resources to apply to the work to continue operations.
 
I like to call resources to loosen coupling “safety gates.” From a conceptual standpoint “safety gates” are our “outs.” They help to give us an out in case decisions go badly. They also may help us to avoid irrevocable decisions, which are decisions we either cannot take back if they are wrong or cannot mitigate after we make them. Sometimes decisions become irrevocable because we have perhaps been overly optimistic and have not built in safety nets to help us recover if or when failure occurs. I used to fly a 4-engine transport aircraft. There are some scenarios where the aircraft could operate on two engines in an emergency, but under some conditions, such as heavy weight and high altitude, two-engine operation might not be enough to sustain level flight. I remember years ago during our multi-engine aircraft simulation training instructors would try to walk us down the path of shutting down two out of our four engines to put ourselves in a tough situation. The conditions were set up so that if we shut down two engines on the same wing we would not have enough power to maintain altitude and would start descending. The simulation instructors would often freeze the flight simulator and ask us to mentally rewind and see if we could perhaps restart the first engine (even if it meant operating it in a reduced power setting) before shutting down the second engine in order to avoid shutting down two engines and getting into an unrecoverable situation (an irrevocable decision).
 
Are there situations in your work environment where you or your teams perform critical tasks and where it is possible to make irrevocable high-consequence decisions? What happens if those decisions turn out wrong and failure occurs? Is it possible to conduct a “pre-mortem” meeting with what-if scenarios to talk bout worst case options and the possibility of building in “safety gates” to help prevent failure from escalating? Is there a way to conduct simulations and rehearsals to try out the implementation of the “safety gates?”  
 
There is no surefire, clear-cut answer, but I hope this newsletter is helpful in getting conversations started so you may identify tightly coupled systems and ways to perhaps loosen those couplings. If so, perhaps if error or failure occurs you may be able to stop the failure chain and recover from it early while minimizing damage to the overall system or organization. Additionally, while you may not be able to recognize all types of risks, by developing a management culture of high-reliability you may help to build a culture where employees and teams seek out information and try to recognize failure early, and build in capacity to deal with impending failure before it escalates beyond acceptable levels.
 
Here are a few resources you may want to consider reading:
 
“Art of Critical Decision Making “(part of The Great Courses) by Professor Michael Roberto
 
Managing the Unexpected: Resilient Performance in an Age of Uncertainty by Karl Weick and Kathleen Sutcliffe
 
Normal Accidents: Living with High-Risk Technologies by Charles Perrow
 
The Black Swan and Anti-Fragile:Things That Gain from Disorder by Nassim Taleb
 
I hope this newsletter was helpful. If so, I would greatly appreciate it if you would share it with others using the links below. Thanks for reading and I wish you a great, safe, and productive day! 

P.S. I am proud to announce that V-Speed's Crew Resource Management Planning and Execution Toolkit will soon be available for purchase. You may get a preview of the content or pre-order a copy here. This guide was written to serve as a sort of "field manual" to help organizations implement some of the concepts from my book Team Leadership in High-Hazard Environments: Performance, Safety and Risk Management Strategies for Operational Teams.

Thanks for reading, and I wish you a safe and productive day!   

If you want to receive FREE and regular actionable content delivered to your inbox, enter you email address below: