Single Points of Failure and the Human Element

When considering potential failure points of technology and systems, it is not uncommon to look for Single Points of Failure (SPF). SPFs are often weak links in a system that can impact system functioning if they actually fail. SPFs can take many forms, such as a single power source supplying power to multiple pieces of equipment. When systems are tightly-coupled they are often dependent on each other. For example, a generator that powers lights and machinery or tools directly affects their performance. The lights and tools are dependent on the generator and are coupled to that generator. If there is no backup system, such as automatic or manual switching to a separate generator or other power generation sources, and if the generator fails, the lights will go out and the tools will stop operating. In loosely-coupled systems there will be ways to continue operations despite the generator failure, and these considerations should be designed into the system.

In some cases SPFs are okay. If a task is not mission-critical or safety-critical, perhaps an SPF is not such a big deal. If the component fails, we just replace it and life goes on. On the other hand, if an organization operates mission-critical processes or relies on safety-critical functions, an SPF at those points could lead to serious operational consequences for mission-critical functions and perhaps catastrophic consequences for safety-critical functions. So, as part of a robust hazard analysis and risk assessment process, it is wise to assess systems for SPFs. However, are SPFs limited to technology and equipment?

If we look hard enough at our organizations we can often find Human Single Points of Failure (SPF). I call these H-SPFs. Organizations often look at Single Points of Failure from a technical standpoint, where if one component or subsystem fails, it will take out numerous other connected systems. This could cause problems for workers. This is a basic concept, but it is easy to miss, and managers often fall back on the reliability of a single component or subsystem, hoping it will not fail (as opposed to planning for potential failure and building in points to decouple one failure point from the rest of the system and leaving themselves "an out").

If this is hard to plan for with technology, consider how hard it can be from a human standpoint. How many organizations rely on "heroes" to get the job done? Rather than building in redundancy and resilience into a team through a structured staffing process, organizations often place a single set of functions on the shoulders of one employee. If that employee becomes sick or leaves the organization for another job, the department may be left scrambling to find ways to figure out how to do that one employee's job. This is an H-SPF. We need to build redundancy with staffing in organizations in order to build resilience, yet in some cases, organizations are staffed in such a lean manner that this can be difficult. Job sharing and designing cross-functional teams may help. This could be sold to leadership by demonstrating the consequences and impacts if the employee is out for a day. Simulations could be conducted to demonstrate the difficulties. For example, on a Monday morning a drill could be conducted with all the employees who work in H-SPF jobs where they are told to not perform their jobs, and then the department heads could work to try to figure out what to do. This might be an effective way to gain leadership buy-in. Whatever strategy is used, a comprehensive approach that includes examining and correcting technical and human SPFs may be a better approach than simply focusing on technology.

If you like this post, please share it with others. Also, if you would like to receive regular content delivered FREE to your inbox, please sign up for our email subscription newsletter. Sign up is easy (just a first name and email address) and you can unsubscribe any time!

Thanks for reading, and have a great and safe day!