Introduction to Resilience Engineering
- Michelle Casey
- Jul 29
- 11 min read
Updated: Aug 4

Introduction
This blog is adapted from my guest appearance on Stephen Townshend’s Slight Reliability podcast, edited for clarity and flow. We explored the differences between reliability, robustness, and resilience, unpacked concepts like complex systems, Safety I vs Safety II and mental models, and perspectives relating to human error. You can listen on Spotify: Intro to Resilience Engineering with Michelle Casey (Episode 101), or by searching for "Slight Reliability" wherever you listen to podcasts
What is Resilience Engineering?
Resilience Engineering is:
An interdisciplinary scientific field
A community of researchers and practitioners with diverse backgrounds working in many domains
Resilience Engineering (RE) emerged from safety science and human factors, and came to software from higher consequence industries such as nuclear power, aviation, and healthcare. It draws a lot on systems thinking, provides an alternative perspective to the idea of human error, and focuses on cognitive work and human performance in the context of complex systems.
RE offers a science-backed approach to enable organisations to manage inherent complexity and uncertainty, improving their ability to adapt and respond in the face of failures, incidents and the unexpected. Ultimately, the people who design, build, and operate complex software systems are the key to resilience.
Reliability, Robustness and Resilience
At this point we need to align around some definitions. These are specifically the scientific definitions and so may be slightly different to how you usually hear these terms used in a technology or software context. I must credit Tim Nicholas for these particular definitions, who I know was heavily influenced by the work of Dr David Woods, in particular Four concepts for resilience and the implications for the future of resilience engineering.
We’ll also be talking about hazards and challenges through this section. In this context, a hazard or challenge is any event or condition that has potential to strain or disrupt the system, pushing it towards failure. This can include latent conditions such as misconfigurations or technical debt, unexpected variability such as load spikes, external disruptions like third party outages and cognitive and coordination challenges such as unclear ownership or information overload. David Woods frames challenges as “escalating pressures or demands” on systems. A hazard or challenge may not necessarily lead to an incident.
So, onto those definitions.
Reliability should be considered an outcome rather than a state. We can only determine if a system has been reliable in retrospect, as in after we have faced a hazard or challenge. Things that have happened in the past define the parameters that we use to assess the reliability of our systems, however to improve reliability we need to do work that increases robustness and resilience.
Robustness is the ability of the system to continue to operate as intended in the face of known hazards or challenges. If we build our system in a way that means it can cope with a particular challenge, we can say the system is robust to that challenge. To emphasise, these are known hazards and challenges, known failure modes - we have seen these failures in the past or we think we will encounter this type of hazard or challenge in the future.
Resilience is the capacity of the system to adapt to unanticipated hazards or challenges, in other words our adaptive capacity. These are things we haven’t faced before that we didn’t think could or would happen, the emergent challenges, and unknown unknowns. Resilience helps us respond to these unanticipated or unknown challenges in the future, reducing the impact where the system was not perfectly robust.
Resilience is the capacity of the system to adapt to unanticipated hazards or challenges, in other words our adaptive capacity.
We can’t predict every possible challenge or failure mode that a system will face, therefore a system cannot ever be perfectly robust or perfectly reliable. However, resilience can help us understand what forms of robustness to implement, which in term contributes to improving our reliability.
Examples: Reliability, Robustness and Resilience
We have a team that looks after an internet facing web application, maybe it’s multi-tenant SaaS or maybe it’s online quote and buy for insurance. This internet facing web application experiences a huge and sudden load spike. The team has experienced this before and has previously implemented things like dynamic and responsive scaling and rate limiting. On this occasion the system was able to handle the huge load spike, therefore we can say it is robust to this hazard, and in hindsight the system was reliable.
One day this internet facing web application experiences a third party failure which blocks their responsive and dynamic scaling. This type of failure is unforeseen, it’s not something the team thought ever could or would happen. However, because the team has a deep understanding of the system gained through past incidents, they’re able to work together to figure out a workaround, implement it and enable their system to recover before their vendor resolves the issue. This is an example of resilience and adaptive capacity.
Two very key details for resilience. The challenge being experienced must be unforeseen, a situation not yet imagined, and resilience and adaptive capacity relies on capabilities that existed before the event.
At this point I need to highlight that resilience is not a synonym for any of these things:
Redundancy
Robustness
High availability
Fault tolerance
Chaos engineering
Anything about software or hardware
Resilience is something the socio-technical system does, not what it has.
Complex Systems
When trying to understand Resilience Engineering, a shift in thinking is required. For me, having words to articulate how I think about complex systems was really key in helping to create that shift in thinking. Safety I and Safety II is another way to wrap your head around the shift in thinking which we will explore in a subsequent section.
Software systems are complex systems, but what does that actually mean? My preferred definition of complex systems must be credited to Laura Nolan, from her talk at LFI Conf 2023, Systems Thinking Methods for Incident Analysis, and I have added some of my own elaboration.
A complex system has:
Multiple components - often many hundreds of components in a software system
Non-linear interactions and feedback loops - a user input in one part of the system can result in an output in another part of the system
Interactions with environment - such as the internet, users, our vendors, malicious actors etc.
Dynamic and constantly changing - not just our own software and infrastructure, but the many vendors, integrations and dependencies they rely on are constantly changing too
Having state and history - our systems remember past states and user interactions
Problems often arise in the interactions not the components
If this is the definition of the systems we are dealing with, it is much less neatly defined than what some like to believe, and therefore much more uncertain and much less controllable.
From my perspective, one of the most important aspects when thinking about complex systems is that they are dynamic and constantly changing, and it’s the engineers that are keeping them running. Fred Hebert from Honeycomb summed it up really nicely in this quote.
“People have a mental model where the system is stable until disturbed, far more often than they have one where the system is balanced because it is constantly intervened with. The latter is a more useful approach to thinking about complex systems.”
In thinking about complex systems, we also can’t go past How Complex Systems Fail by Dr Richard Cook. This is generally considered essential reading for anyone interested in incidents and system failures. I strongly recommend you read the article, as I will be unable to do it justice here.
Safety I and Safety II
Erik Hollnagel is responsible for the idea of Safety I and Safety II, as a way to highlight and contrast what makes RE different. A Tale of Two Safeties and From Safety-I to Safety-II: A White Paper are two key resources here.
Safety I is about reducing failure, as few things as possible go wrong. There is a focus on human error and humans are considered a liability or hazard. Accidents (or incidents) are caused by failures/malfunctions and learning comes predominantly from things that go wrong.
Safety II is about increasing success, as many things as possible go right. The focus is on human performance - the system succeeds because of humans and their expertise, their ability to monitor and observe the dynamic complex system, and learn, adapt and respond as needed. When it comes to an accident or incident, we want to understand how things usually go right as the basis for understanding how things occasionally go wrong - what were the conditions for variability in system performance?
Resilience: It's not you, it's the system provides a nice accessible explanation of this from the context of the New Zealand health system.
Human Error
In RE, there is acknowledgement that ‘human error’ can’t really be considered a cause or an acceptable conclusion, rather it is an indicator of underlying issues in the wider system that need to be further investigated. Human performance is considered to be systematically a result of an engineer's tools, technology, goals and operating environment. It’s not the human in isolation, you’ve got to look at the whole system. This thinking around human error precedes RE, but it is related.
‘Human error’ can’t really be considered a cause or an acceptable conclusion, rather it is an indicator of underlying issues in the wider system that need to be further investigated.
Let's explore that further with the help of Behind Human Error by Dr David Woods, Sydney Dekker, Dr Richard Cook and Leila Johannesen and The Field Guide to Understanding Human Error by Sidney Dekker.
I think everyone who is involved in incidents has probably heard something like “Why didn’t the responder do X”, or “Why didn’t the engineer know to do Y”. This sort of thing is a great example of a Human Error view of things, and it’s also an example of hindsight bias and counterfactuals.
Hindsight bias changes how we look at past decisions and actions. It takes something that was very complex and difficult at the time and simplifies it into something very linear and uncomplicated. When people know the outcome, they overestimate their ability to predict and prevent the outcome.
When people know the outcome, they overestimate their ability to predict and prevent the outcome.
A counterfactual is counter to the facts - if they had done this, things would have been different. It’s a reality that didn’t happen, and that doesn’t help us understand the incident.
The ‘tunnel’ analogy is often referenced here, from Sidney Dekker’s book.

When it comes to incidents, we want to understand how people’s actions made sense at the time, given their goals, context and knowledge and what they were paying attention to. This is known as the local rationality principle. We need to understand the perspective from inside the tunnel when the outcome was not yet known, but we need to pair this with the wider organisational context. Simply saying “Why didn’t the engineer do that, they should have done this” ignores really important details about the incident, prevents us from learning and reduces psychological safety. These sorts of questions also don’t acknowledge a lot of what we know about complex systems and system failure that we’ve covered so far in this blog.
To get the true value from our incident reviews, we need to discard counterfactual thinking and hindsight bias, and try to put ourselves in the shoes of responders. We need to put ourselves inside the tunnel.
Mental Models
We talk a lot about mental models in RE, but what actually is a mental model?
A mental model is an internal representation or framework that an individual (or engineer) uses to understand and interact with the complex system, helping them make sense of information, interpret situations, and predict outcomes.

The concepts and visuals from Dr Richard Cook’s Above the Line, Below the Line and The STELLA Report by SNAFUcatchers Consortium are useful when it comes to understanding mental models. Below the line is ‘the system’, however engineers never actually touch the system or see it, they are interacting with representations of the system. That’s what’s above the line, those representations and interfaces.
As we’ve covered previously, our systems are constantly changing over time. It was Dr Gary Klein who said something like ‘mental models are stories not pictures’. They have a time dimension. When we think about our mental model of a system, it’s useful to think about them as a thing that continuously evolves, rather than being static.
If we put all of this together, an engineer understands ‘the system’ through their conceptual understanding and their own mental model, all while using those representations and interfaces to interact with the system, while both the system and their mental model constantly change and evolve. Above the line is where that happens. Observing, inferring, anticipating, planning, troubleshooting, diagnosing, correcting, modifying, reacting - otherwise known as cognitive work. For a more complete explanation I would recommend watching John Allspaw’s talk, How Your Systems Keep Running Day After Day.
I’m sure many of us have experienced incidents that were really surprising or confusing and just didn’t make sense. The way the failure played out and the behaviour of the system did not align with our existing mental model and the different ways we thought the system could fail.
These sorts of incidents can be a good reference point for thinking about the importance of mental models in RE. We have this complex system that is constantly changing, and our mental models also need to adapt and change in line with this. Due to the scale, complexity and dynamic nature of the system, it is not possible for an individual person or engineer to have a ‘complete and correct’ mental model. What we need is for our engineers to have different and overlapping mental models of the system that can be combined together when it comes to incident response.
Reframing Incidents for Learning and Resilience
In order to learn from incidents, proponents of Resilience Engineering (RE) undertake incident analysis. This is done through drawing on techniques such as naturalistic decision making and cognitive task analysis, using this ‘new view’ of human error and stepping away from hindsight bias and counterfactuals.
This is no small task, and not something that can be taught in a blog post (or podcast episode). However, you can start to reframe how you think about incidents.
Colette Alexander broke this down really nicely in a recent DORA Community Discussion which also included a ‘speedrun’ of Safety Science greatest hits, remembering that RE is rooted in Safety Science.

In the context of RE you will often hear people speak of contributing factors. While this can be a good first step for reframing how you think about incidents, there is a bit more to it than just that. When it comes to an incident, we want to understand how things usually go right as the basis for understanding how things occasionally go wrong. What adaptations and interventions are happening around us to keep our systems running?
By producing detailed and coherent incident narratives and presenting these in a way that encourages insightful and productive discussions, we can facilitate incident learning. Engineers can hear the story of what happened and observe how others detected, diagnosed and responded. They are able to reflect and internalise lessons without having been present in the incident, in turn updating and growing their mental model of the dynamic and constantly changing complex system.
For the avoidance of doubt, learning from incidents is not these things:
Learning what actions to take
Learning how to make sure an incident doesn't happen again
Learning how to fix humans, rather than systems
Through learning from incidents, we want to better enable engineers and system operators to compare and grow their mental models for these massively complex and constantly changing systems. We want to enable better decision making for engineering leaders, and ultimately improve both robustness and resilience of our systems and our organisations.
In the words of John Allspaw,
“The richest understanding of the event, for the broadest possible audience.”
References
Dr David Woods, (2015), Four concepts for resilience and the implications for the future of resilience engineering
Laura Nolan, (2023), Systems Thinking Methods for Incident Analysis
Dr Richard Cook, (1998), How Complex Systems Fail
Erik Hollnagel, (2013), A Tale of Two Safeties
Erik Hollnagel, (2013), From Safety-I to Safety-II: A White Paper
Dr Carl Horsley, (2018), Resilience: It's not you, it's the system
Dr David Woods, Sydney Dekker, Dr Richard Cook and Leila Johannesen, (2010), Behind Human Error
Sidney Dekker, (2014), The Field Guide to Understanding Human Error
SNAFUcatchers Consortium, (2017), STELLA: Report from the SNAFUcatchers Workshop on Coping With Complexity (The STELLA Report)
Dr Richard Cook, (2019), Above the Line, Below the Line
John Allspaw, (2017), How Your Systems Keep Running Day After Day
Colette Alexander, (2025), DORA Community Discussion - Resilience Engineering
Additional resources for getting started with Resilience Engineering
Nora Jones, Laura McGuire and Vanessa Huerta Granda, (2021), Post-Incident Howie Guide
Lorin Hochstein, (2024), Resilience Engineering: Where do I start?




Comments