In most organizations, operations are complex processes, with many interconnected parts. We often make linear plans, hoping that things will move smoothly from Point A to Point B, yet when planning moves into execution, we can often find ourselves in difficult situations which rapidly involve non-linear actions. This makes decision-making more difficult. I think that DevOps and IT Operations teams may find this familiar, just like I did when I was a pilot in the Marine Corps…
A little over ten years ago I found myself in a very interesting, yet precarious situation. It was not one that I had anticipated, nor predicted, yet one that required rapid decision-making while eating a “soup sandwich.” A soup sandwich is a term we used in the Marine Corps to describe a really messy situation with no good way to solve it. Just imagine trying to eat a sandwich made out of soup. It is sloppy and messy, but when you’re hungry, you’ll eat it. I was flying the KC-130 Hercules (which is sort of like a giant flying gas station) into Baghdad on a tactical approach profile, which is essentially the process of flying the aircraft in a safe and efficient manner to help avoid the enemy threat. The goal was to get into the airport as quickly as possible using the safest route possible. Essentially, flying from Point A to Point B, which should be simple, right? (See Image below).
Well, it would have been, but I quickly found out how a linear process can rapidly turn into a non-linear mess. While approaching the airport the air traffic controller directed us to enter through a different geographical area than we had originally planned and began vectoring another aircraft towards the airport at the same time. We had no information on the other aircraft, except where we thought it was coming from. The controller did not provide us with landing instructions or clearance. This is when I had to start eating the soup sandwich. (See Image below).
You see, when flying in a combat zone we really strive to stick to our motto “first pass, full stop.” What this means is that we want to nail our landing, just like a pitcher landing a perfect strike in baseball or a basketball player hitting that perfect 3 point shot. “First pass, full stop” is both a tactically proficient method, but shows pride in our work as aviators. We want to do our best. Additionally, in a tactical combat situation we try to avoid executing what is called a “go-around,” which is when we overfly the airport low and slow and come back around for a landing. We avoid this because a go-around leaves us excessively exposed to the enemy threat.
But there my crew and I were with some decisions to make in a matter of minutes. As the aircraft commander I felt the heavy pressure to make the best decision possible, given that there was really no perfect decision. Do I land the plane without permission? Do I perform a go-around? I slowed the plane as much as I could to give the other aircraft a chance to get ahead of us, hoping it would land and taxi clear of the runway. If so my intent was to land with or without permission. My mission was to safely land the aircraft and I didn’t care what this controller told me. As I slowed the aircraft to what we called Max Effort speed, we saw the other airplane arrive in front of us. We rapidly discussed as a crew what our options were and decided that we would take the approach as low as possible and land the aircraft if the plane ahead of us was clear of the runway. We got to about 10 feet above the runway and since the other aircraft had just cleared the runway we felt it was too dangerous. Multiple voices on the aircraft shouted at once “GO AROUND!” I immediately reacted, having executed the procedures more times than I can count in rehearsals/practice flights. It was not optimum from a tactical standpoint, but we had more than one safety consideration at that point and it seemed like the right thing to do. I am still here, writing this post, so I guess it was the right decision, but does the outcome justify the process in all cases? More on that later…
I have reflected on this event over the past several years and realized how much it taught me about the way work is performed in the real-world as compared to how it is designed and planned on paper, the amazing power of adaptable teams and the adaptive capacity of humans and technology resources to adjust under times of stress. But how far can individuals, teams and systems be stretched before they break? Are their resources to help team leaders, system managers, and team members to contend with the dynamic reality of work? While we are on this subject? Are the concepts I just described about my aviation experience and the questions posed that much different from situations and questions DevOps teams face on a regular basis? The more I learn about DevOps the more I’m convinced that DevOps teams face many of the same challenges we tackled in USMC aviation while transitioning to more automated and software-intensive aircraft. In the next section I provide some guidelines that I have learned both through my knowledge, application and teaching Crew Resource Management and aspects of system safety. In my opinion the divide between the issues faced by combat aircrew and DevOps teams isn’t really all that big and in reality we can probably learn a lot from each other.
7 Guiding Principles for Successful and Resilient Organizational Performance:
Resilient performance means organizations can identify risks in advance, preempt those risks and/or diffuse them in a way that allows them to continue operations, despite disturbances in the system. Just like in military aviation, where we had to deal with risks, such as enemy threat, bad weather, terrain, and breakdowns in situational awareness, IT organizations also contend with risks to development and operational performance. Whether the risks are related to delays in software releases or server uptime, DevOps teams would be well served to devise strategies to improve their ability to detect risks, and increase their abilities to handle these risks if or when they occur. Here are several guidelines that may be useful to help DevOps teams increase resilience and overall organizational performance.
1. Wipe out the zero defect mentality with regard to human performance. A zero defect mentality is an attitude where people believe workers must never make mistakes. When leaders, managers and coworkers are intolerant of mistakes, this can create an environment of fear and distrust. We need to start the conversation about resilience and improved team performance by wiping out the zero defect mentality. In complex work environments people can and will make mistakes. While we strive to set up rules and heuristics for decision-making, sometimes there is no one-size-fits-all rule and teams have to rapidly make decisions based on the information available and their goal hierarchies.
2. Acknowledge the reality that there is a gap between Work-As-Designed and Work-As-Performed. Systems will often function the way they were designed and if there are system deficiencies, it will often be the human and team that make up for these deficiencies. In Marine Corps aviation we would often self-organize and create techniques to make up for the gaps in deficient software and hardware design. We would then teach these techniques as a means for informal knowledge sharing. This is not unique to USMC aviation and I believe others have stories like this. In fact, Sidney Dekker addressed the need for operational workarounds in his keynote address at the 2014 American Society of Safety Engineers Professional Development Conference in Orlando, Florida. During his presentation he described how workers “finish the design” and make up for the shortcomings designers may not have realized during the system design, construction and deployment process. On page 158 of the Third Edition of The Field Guide to Understanding Human Error he describes how pilots placed a paper cup on the flap handle of a commercial airliner so as to not forget to place the flaps in the correct position.1Sometimes designers and planners don’t foresee every circumstance where humans may be required to adapt to the operational environment.1 Sure, designers and planners can (and should) attempt to develop a hierarchy of hazard controls to optimize the system for human performance, but in some cases the need for specific controls themselves may not be understood at the time the system is designed or deployed. Alternatively, they may actually design hazard controls into the system, but those controls may still be bypassed (intentionally or unintentionally) as workers perform their tasks and make what Erik Hollnagel describes as Efficiency-Thoroughness Trade-Offs.2 In some cases workers may even adapt procedures in an attempt to make operations less risky and more effective/efficient, given their perspective and the operational context. This holds true with multiple forms of risk controls and operational performance tools, such as checklists. I believe DevOps teams use their teamwork and creative problem solving skills in much the same manner. By identifying the gaps between system design and how work is actually performed, leaders and managers can find out if the human workarounds may be injecting unintended harm into the development and production process and they may even find that DevOps teams have created a solution that can be implemented across the business to reduce risk, and improve effectiveness or efficiency.
3. Understand that human error and blaming people for problems doesn’t fix system problems. A lazy investigation process will often point to human error as the cause of a problem or failure. Even if investigation teams have the best intentions, they may simply not understand how to investigate beyond human error. While human error may be a causal factor in the accident or failure chain of events, it is often the proximal cause, occurring at the last point before failure is actually realized. Deeper investigation will often reveal system deficiencies (distal causal factors) that may have made it very difficult for humans to recognize and respond appropriately to early signals of failure. This is sometimes referred to as an error-provocative environment, where system design actually induces people to make mistakes or serves as precursors to error. In fact, it is often because of (not in spite of) people’s creativity and capability to produce good work that organizations are able to achieve successful performance. Processes are not perfect. People are not perfect. Investigators need to have a degree of empathy when conducting post-mortems and investigations. They also need to conduct After-Action Reviews on successful events to understand what people and systems are doing right and how those processes may be repeated. Finding and rectifying system deficiencies may go a long way in helping people to do their jobs right and for making it harder for them to do their jobs wrong.
4. Don’t base success simply on outcomes because the end doesn't necessarily justify the means. Process is just as important as outcomes (and maybe more) because if we only focus on outcomes we may end up using flawed work methods or processes and still get the end result we desire. In my aviation example, what if my actions (which seemed correct at the time) had resulted in an accident? Would investigators, succumbing to hindsight bias, have felt that we should have made a different decision? If we don’t examine if our processes or work methods are flawed or if they have deficiencies “baked into the recipe” we may never know if the seeds of failure are planted in the process and we may experience failure the next time we try to execute with those processes. Does the phrase, “Sometimes it is better to be lucky than good” come to mind?
5. Acknowledge the need for adaptability and adaptive capacity. It is often because of a team’s ability to anticipate and respond to problems that organizations achieve success. For example, even for small releases that may not impact a database an organization may still have a Database Administrator on a teleconference because there may still be a risk that something could happen in the late night hours, mid-way through the release that might impact the database and would require the DBA’s involvement. If the organization doesn’t plan to have the DBA on the call in advance, the DBA as a critical resource could be asleep or other wise occupied and unavailable or not easily recalled. If we simply try to create a plan and force people to stick to that plan when the operational and working environment conditions clearly indicate adaptation is required we will likely set ourselves up for failure. As Eisenhower once said, “Plans are useless, planning is everything.” While plans may not actually be useless, the value is really in planning because it elevates the individual and collective awareness of the organization’s and team’s objectives, resources, timelines, and activities. Then, when the operational environment throws a curveball during execution, the teams know how to adapt smartly and safely.
6. Break down the authority gradient between ranks or positions for open communications to speed up execution and foster a bias for action. I am not advocating that everyone simply be allowed to make their own decisions willy-nilly, but I am advocating that organizations must learn to empower those on the “front lines” in DevOps teams to make decisions based on their functional and/or technical expertise. When people are overly intimidated by a senior team member’s rank and/or experience this can stifle information sharing and decision-making. It took us years in Marine Corps aviation to solve this challenge, as we are very hierarchical and the aircraft commander is the one in charge. That being said, in my aviation example above, even some of the more junior crewmembers called the Go-Around. If I were to have shut them down because of my rank or position power the consequences could have been much worse. DevOps teams can create team methods that break down these barriers to effective communication, decision-making and learning. I do, however, think it is important to create processes for sharing information across the team and with those who have the ultimate responsibility to “answer the mail” when things go wrong. Additionally, if there is a critical decision that could have dire business consequences if the team gets it wrong organizations should consider having a “risk hotline” so teams know whom to call to get help with the decision-making process.
7. Build a shared understanding of each team member’s work, so that team members can understand the immediate impact of decisions and cascading impacts across the team. While I was a pilot, I tried to understand what the Crew Chiefs’ and Loadmasters’ jobs entailed. In fact, because our actions were so tightly coupled, there was little room for error, so I had to understand what they were doing. For example, if we were conducting aerial delivery of cargo (where we launch the cargo out the back of the plane with parachutes) and I pulled the nose of the aircraft up too early, I could have caused injury to personnel. So, our checklists were designed to build awareness of each other’s tasks, but we also had to have knowledge about these tasks, beyond simple perfunctory adherence to checklists. This collective mindfulness becomes important in building highly coordinated teamwork, where each member can anticipate what is required to happen in the future, with individual actions and system performance. This helps build higher levels of situational awareness, and I feel this gets to some of the key goals of creating DevOps teams. By creating a collective awareness, these teams can harness the power of both developers and operations teams for improved continuous delivery. Like in Marine Corps aviation, this could help improve situational awareness, reduced risk, and higher levels of precision.
While this is just a short list of guidelines, I hope you find them useful. These were some issues we discovered in Marine Corps aviation, and several I realized later on while as a student in, and later teaching in a Master of Engineering in Advanced Safety Engineering and Management curriculum. Sure, there is room for improvement in the military aviation community and in industry. These will not solve all an organization’s DevOps challenges, but they may go a long way in helping DevOps teams and the organization as a whole build a more open, honest, trust-based environment to improve collaboration during software development and deployment. The key is not simply to read these guidelines and understand them, but to inculcate the guidelines into daily habits, which are practiced until they become second nature. Try them out, see what works, and commit to them. Then improve them over time. They should become a “way of life” in the organization. When team members feel “this is the way we do things around here” you know you are on your way to cultural transformation. Then you are on your way to improved resilience and organizational performance. This helped us improve team performance and reduce risk in USMC Aviation and I think it can help DevOps teams.
P.S. If you liked this article, please send me a note using our contact page. Let me know what you liked, didn’t like or what you would like to see in a future article. I am considering a future article with specific examples DevOps teams could use in each of the 7 guiding principles in this post. If you are interested please let me know. Also, I would greatly appreciate it if you would share this using the share buttons on the left of the page, or simply forward the link. This Fall I am working on some of my PhD work and will be investigating sustained adaptability. I hope to report my progress and observations later this Fall.
Also, if you want no-nonsense info designed to help you improve team performance and to help your DevOps teams think like special forces or combat aviation flight crews, enter your email address below. I won’t send you spammy junk. Just good stuff to help you improve.
1. For a description on how workers “finish the design” see Dekker, Sidney. The Field Guide to Understanding Human Error 3rd Burlington : Ashgate Publishing Company, 2014. Portions of this section were originally in the following post: http://www.safetydifferently.com/flaps-coffee-cups-and-nvgs-a-tale-of-two-safeties/
2. For a detailed explanation of the ETTO Principle see Hollnagel, Erik. The ETTO Principle Efficiency-thoroughness Trade-off : Why Things That Go Right Sometimes Go Wrong. Farnham, England: Ashgate, 2009. Print.
For more information on balancing risk and organizational performance, see Cadieux, Randy E. Team Leadership in High-Hazard Environments Performance, Safety and Risk Management Strategies for Operational Teams. Burlington: Gower Publishing Company, 2014.
About the Author: Randy Cadieux is the Founder of V-Speed, LLC and the Product Manager of the Crew Resource Management PRO team performance system. He routinely works to educate and train organizations on improving team and organizational resilience, and operations performance.