Understanding Resilience Engineering: A Deep Dive

Michelle Casey
Jul 29, 2025
10 min read

Updated: Dec 12, 2025

Introduction to Resilience Engineering

This blog is adapted from my guest appearance on Stephen Townshend’s Slight Reliability podcast, edited for clarity and flow. We explored the differences between reliability, robustness, and resilience. We unpacked concepts like complex systems, Safety I vs Safety II, mental models, and perspectives relating to human error. You can listen on Spotify: Intro to Resilience Engineering with Michelle Casey (Episode 101), or by searching for "Slight Reliability" wherever you listen to podcasts.

What is Resilience Engineering?

Resilience Engineering is:

An interdisciplinary scientific field.
A community of researchers and practitioners with diverse backgrounds working in many domains.

Resilience Engineering (RE) emerged from safety science and human factors. It came to software from higher consequence industries such as nuclear power, aviation, and healthcare. RE draws heavily on systems thinking. It provides an alternative perspective on human error. It focuses on cognitive work and human performance in the context of complex systems.

RE offers a science-backed approach to enable organisations to manage inherent complexity and uncertainty. This improves their ability to adapt and respond in the face of failures, incidents, and the unexpected. Ultimately, the people who design, build, and operate complex software systems are key to resilience.

Reliability, Robustness, and Resilience

At this point, we need to align around some definitions. These definitions are specifically scientific and may differ from how you usually hear these terms used in technology or software contexts. I must credit Tim Nicholas for these definitions. He was heavily influenced by the work of Dr. David Woods, particularly in Four concepts for resilience and the implications for the future of resilience engineering.

We will also discuss hazards and challenges throughout this section. In this context, a hazard or challenge is any event or condition that has the potential to strain or disrupt the system, pushing it towards failure. This can include latent conditions such as misconfigurations or technical debt, unexpected variability like load spikes, external disruptions such as third-party outages, and cognitive and coordination challenges like unclear ownership or information overload. David Woods frames challenges as “escalating pressures or demands” on systems. A hazard or challenge may not necessarily lead to an incident.

Definitions Explained

Reliability should be considered an outcome rather than a state. We can only determine if a system has been reliable in retrospect, after we have faced a hazard or challenge. Past events define the parameters we use to assess the reliability of our systems. However, to improve reliability, we need to work on increasing robustness and resilience.

Robustness is the ability of the system to continue operating as intended in the face of known hazards or challenges. If we build our system to cope with a particular challenge, we can say the system is robust to that challenge. To emphasise, these are known hazards and challenges, known failure modes—we have seen these failures in the past or believe we will encounter them in the future.

Resilience is the capacity of the system to adapt to unanticipated hazards or challenges, in other words, our adaptive capacity. These are things we haven’t faced before that we didn’t think could or would happen—the emergent challenges and unknown unknowns. Resilience helps us respond to these unanticipated or unknown challenges in the future, reducing the impact where the system was not perfectly robust.

Resilience is the capacity of the system to adapt to unanticipated hazards or challenges, in other words, our adaptive capacity.

We can’t predict every possible challenge or failure mode that a system will face. Therefore, a system cannot ever be perfectly robust or perfectly reliable. However, resilience can help us understand what forms of robustness to implement, which in turn contributes to improving our reliability.

Examples: Reliability, Robustness, and Resilience

Consider a team that manages an internet-facing web application, perhaps a multi-tenant SaaS or an online quote and buy for insurance. This application experiences a sudden load spike. The team has faced this before and has implemented dynamic scaling and rate limiting. On this occasion, the system handles the load spike, so we can say it is robust to this hazard. In hindsight, the system was reliable.

Now, imagine this application encounters a third-party failure that blocks its responsive scaling. This failure is unforeseen; it’s not something the team anticipated. However, because the team has a deep understanding of the system gained through past incidents, they can work together to find a workaround. They implement it and enable their system to recover before the vendor resolves the issue. This is a clear example of resilience and adaptive capacity.

Two key details for resilience are: the challenge must be unforeseen, and resilience relies on capabilities that existed before the event.

At this point, it’s essential to highlight that resilience is not synonymous with:

Redundancy
Robustness
High availability
Fault tolerance
Chaos engineering
Anything about software or hardware

Resilience is something the socio-technical system does, not what it has.

Complex Systems

Understanding Resilience Engineering requires a shift in thinking. For me, having the vocabulary to articulate my thoughts about complex systems was crucial in creating that shift. Safety I and Safety II provide another framework for understanding this shift, which we will explore in a subsequent section.

Software systems are complex systems. But what does that mean? My preferred definition of complex systems is credited to Laura Nolan, from her talk at LFI Conf 2023, Systems Thinking Methods for Incident Analysis, with some elaboration.

A complex system has:

Multiple components—often hundreds in a software system.
Non-linear interactions and feedback loops—a user input in one part of the system can result in an output in another part.
Interactions with the environment—such as the internet, users, vendors, and malicious actors.
Dynamic and constantly changing—not just our software and infrastructure, but the many vendors, integrations, and dependencies are also in flux.
State and history—our systems remember past states and user interactions.
Problems often arise in the interactions, not the components.

Given this definition, the systems we deal with are much less neatly defined than some might believe. They are more uncertain and less controllable.

From my perspective, one of the most important aspects of complex systems is their dynamic nature. Engineers keep them running. Fred Hebert from Honeycomb summed it up nicely in this quote.

“People have a mental model where the system is stable until disturbed, far more often than they have one where the system is balanced because it is constantly intervened with. The latter is a more useful approach to thinking about complex systems.”

In considering complex systems, we must also reference How Complex Systems Fail by Dr. Richard Cook. This reading is essential for anyone interested in incidents and system failures. I strongly recommend it.

Safety I and Safety II

Erik Hollnagel introduced the concepts of Safety I and Safety II to highlight what makes RE different. Two key resources are A Tale of Two Safeties and From Safety-I to Safety-II: A White Paper.

Safety I focuses on reducing failure, aiming for as few things as possible to go wrong. The emphasis is on human error, viewing humans as liabilities or hazards. Accidents are caused by failures or malfunctions, and learning comes primarily from things that go wrong.

Safety II aims to increase success, striving for as many things as possible to go right. The focus shifts to human performance. The system succeeds because of human expertise, their ability to monitor and observe the dynamic complex system, and their capacity to learn, adapt, and respond as needed. When an accident occurs, we want to understand how things usually go right to grasp how things occasionally go wrong.

Resilience: It's not you, it's the system provides an accessible explanation of this from the context of the New Zealand health system.

Human Error

In RE, ‘human error’ cannot be considered a cause or an acceptable conclusion. Instead, it serves as an indicator of underlying issues in the wider system that require further investigation. Human performance results from an engineer's tools, technology, goals, and operating environment. It’s not the human in isolation; we must consider the whole system. This perspective on human error predates RE but is related.

‘Human error’ can’t really be considered a cause or an acceptable conclusion; rather, it is an indicator of underlying issues in the wider system that need to be further investigated.

Let’s delve deeper with the help of Behind Human Error by Dr. David Woods, Sydney Dekker, Dr. Richard Cook, and Leila Johannesen, and The Field Guide to Understanding Human Error by Sidney Dekker.

Those involved in incidents often hear questions like, “Why didn’t the responder do X?” or “Why didn’t the engineer know to do Y?” These questions exemplify a Human Error perspective and highlight hindsight bias and counterfactuals.

Hindsight bias alters our view of past decisions and actions. It simplifies complex situations into linear narratives. When people know the outcome, they overestimate their ability to predict and prevent it.

When people know the outcome, they overestimate their ability to predict and prevent the outcome.

A counterfactual is contrary to the facts—if they had done this, things would have been different. It’s a reality that didn’t happen and doesn’t help us understand the incident.

The ‘tunnel’ analogy is often referenced here, from Sidney Dekker’s book.

Counterfactuals. Source: The Field Guide to Understanding Human Error by Sidney Dekker.

When examining incidents, we want to understand how people’s actions made sense at the time, given their goals, context, knowledge, and what they were paying attention to. This is known as the local rationality principle. We need to consider the perspective from inside the tunnel when the outcome was unknown, while also pairing this with the broader organisational context. Simply asking, “Why didn’t the engineer do that? They should have done this,” overlooks crucial details about the incident. This approach hinders learning and reduces psychological safety. Such questions also fail to acknowledge much of what we know about complex systems and system failure.

To derive true value from our incident reviews, we must discard counterfactual thinking and hindsight bias. We need to place ourselves in the shoes of responders and see inside the tunnel.

Mental Models

We frequently discuss mental models in RE, but what exactly is a mental model?

A mental model is an internal representation or framework that an individual (or engineer) uses to understand and interact with the complex system. It helps them make sense of information, interpret situations, and predict outcomes.

An inclusive view of the system. Source: STELLA: Report from the SNAFUcatchers Workshop on Coping With Complexity (The STELLA Report) by SNAFUcatchers Consortium.

The concepts and visuals from Dr. Richard Cook’s Above the Line, Below the Line and The STELLA Report by SNAFUcatchers Consortium are useful for understanding mental models. Below the line is ‘the system’; however, engineers never actually touch or see it. They interact with representations of the system. That’s what’s above the line—those representations and interfaces.

As we’ve noted, our systems constantly change over time. Dr. Gary Klein said, “Mental models are stories, not pictures.” They have a time dimension. When we think about our mental model of a system, it’s helpful to view them as evolving entities rather than static constructs.

When we combine all this, an engineer understands ‘the system’ through their conceptual understanding and their mental model. They use those representations and interfaces to interact with the system, while both the system and their mental model constantly change and evolve. Above the line is where this interaction occurs. Observing, inferring, anticipating, planning, troubleshooting, diagnosing, correcting, modifying, and reacting—this is cognitive work. For a more complete explanation, I recommend watching John Allspaw’s talk, How Your Systems Keep Running Day After Day.

Many of us have experienced incidents that were surprising or confusing and didn’t make sense. The failure and the system's behavior did not align with our existing mental model and our expectations of how the system could fail.

These incidents can serve as reference points for understanding the importance of mental models in RE. We have a complex system that is constantly changing, and our mental models must also adapt and evolve. Due to the scale, complexity, and dynamic nature of the system, it is impossible for any individual engineer to have a ‘complete and correct’ mental model. What we need is for our engineers to possess different and overlapping mental models of the system that can be combined during incident response.

Reframing Incidents for Learning and Resilience

To learn from incidents, proponents of Resilience Engineering (RE) engage in incident analysis. This involves using techniques such as naturalistic decision-making and cognitive task analysis. We must adopt this ‘new view’ of human error while stepping away from hindsight bias and counterfactuals.

This is no small task and cannot be taught in a blog post (or podcast episode). However, you can start to reframe your thinking about incidents.

Colette Alexander articulated this well in a recent DORA Community Discussion, which also included a ‘speedrun’ of Safety Science's greatest hits. Remember, RE is rooted in Safety Science.

Source: DORA Community Discussion - Resilience Engineering by Colette Alexander.

In the context of RE, people often discuss contributing factors. While this is a good first step in reframing your thinking about incidents, it’s more complex than that. When an incident occurs, we want to understand how things usually go right as the basis for understanding how things occasionally go wrong. What adaptations and interventions are happening around us to keep our systems running?

By producing detailed and coherent incident narratives and presenting them in ways that encourage insightful discussions, we can facilitate incident learning. Engineers can hear the story of what happened and observe how others detected, diagnosed, and responded. They can reflect and internalise lessons without having been present during the incident, updating and growing their mental model of the dynamic and constantly changing complex system.

For clarity, learning from incidents does not involve:

Learning what actions to take.
Learning how to prevent an incident from happening again.
Learning how to fix humans rather than systems.

Through learning from incidents, we aim to better enable engineers and system operators to compare and grow their mental models for these complex and constantly changing systems. We want to empower engineering leaders to make better decisions and ultimately improve both the robustness and resilience of our systems and organisations.

In the words of John Allspaw,

“The richest understanding of the event, for the broadest possible audience.”

References

Dr. David Woods, (2015), Four concepts for resilience and the implications for the future of resilience engineering.
Laura Nolan, (2023), Systems Thinking Methods for Incident Analysis.
Dr. Richard Cook, (1998), How Complex Systems Fail.
Erik Hollnagel, (2013), A Tale of Two Safeties.
Erik Hollnagel, (2013), From Safety-I to Safety-II: A White Paper.
Dr. Carl Horsley, (2018), Resilience: It's not you, it's the system.
Dr. David Woods, Sydney Dekker, Dr. Richard Cook, and Leila Johannesen, (2010), Behind Human Error.
Sidney Dekker, (2014), The Field Guide to Understanding Human Error.
SNAFUcatchers Consortium, (2017), STELLA: Report from the SNAFUcatchers Workshop on Coping With Complexity (The STELLA Report).

10. Dr. Richard Cook, (2019), Above the Line, Below the Line.

11. John Allspaw, (2017), How Your Systems Keep Running Day After Day.

12. Colette Alexander, (2025), DORA Community Discussion - Resilience Engineering.

Additional Resources for Getting Started with Resilience Engineering

Laura Mcguire, Nora Jones and Vanessa Huerta Granda, (2021), Post-Incident Howie Guide.
Lorin Hochstein, (2024), Resilience Engineering: Where do I start?.
Resilience in Software Foundation.
This Is Fine! podcast.