top of page
Search

Outages > Incidents > Resilience

  • Writer: Myles Henaghan
    Myles Henaghan
  • Aug 22, 2023
  • 1 min read

🚨You're coming into work stressed about (yet another) production incident. People have questions, but the answers focus on yesterday's drama rather than the pattern of declining reliability. 🥛Glass half-full: at least your alerting, on-call and incident management process is working well. Sound familiar? Well done, you're passing stage 1 of 3 on software reliability, time for stage 2.


  • Stage 1: 🔦Lights on, get organised 

  • Stage 2: 🎄Managing Reliability

  • Stage 3: 🩺Paying attention



Stage 1: 🔦Lights on, get organised 🔔


Find out before the customer when a system fails. Heavy focus on observability, tooling, and process (checks, alerts, on-call process, incident management).


Stage 2: 🎄Managing Reliability 


Dashboards lit up like a Christmas tree. Alert fatigue. Begin managing reliability targets alongside delivery, quality and security. Focus on incremental improvement through SLIs, SLOs, and SLAs. 


Stage 3: 🩺Paying micro attention ⚠️


Past performance is not an indicator of future performance. A 99.999 uptime service can and will still fail. Signals of tomorrow's outage are available today. Focus on daily log hygiene, anomalies, learning from other team's incidents, and game days for preparation.



Originally published on LinkedIn.

 
 
 

Comentarios


Ya no es posible comentar esta entrada. Contacta al propietario del sitio para obtener más información.

SUBSCRIBE

Sign up to receive

Wires Uncrossed Engineering news and updates.

Thanks for submitting!

©2024 by Wires Uncrossed Engineering

Website Design By Solute Digital

  • LinkedIn
bottom of page