Incident Post Mortem — Toward Learning Organization

Noverino Rifai
3 min readJan 6, 2022

Starting this topic with some sayings

“Nobody perfect”

“No system perfect”

When a system disrupted, it will cause unexpected behavior, malfunctioning, or even out-of-service (partial or full system down). We call this kind of disruption as incident.

After receiving incident report, we usually do following actions:

  1. determine level of criticality base on business impact, this is to decide whether immediate action is required, or just put into queue first (to be handled later).
  2. team up, assign team and person in charge to deal with the incident. Some organizations may already have preassigned team to handle operational issues, while some may need to team up based on context of the issues.
  3. define communication channel to speed up investigation and resolution. this may be email (for low critical issues), group chat, conference call, or even put relevant teams together into one room (for very high critical issues).
  4. collecting facts from initial reporters and by observing system behavior. Combine these facts to get general pattern to identify similar issue in the future.
  5. brainstorming all potential root causes. List down all possibilities as reference for further investigation. Some root causes may be concluded as blameless after investigation but still worth to mention at this point.
  6. pick one previously listed root cause, perform investigation. If root cause confirmed, then find solution and corrective actions. During this exercise, new root cause may arise, add it to the list.
  7. execute all required corrective actions
  8. if issue still happen, back to step #5
  9. if system is back to normal, inform stakeholders.
  10. case closed

From problem solving point of view, the action stop on the last step. What happen if the same issue recur? if issue is handled by same people, they may already know what to do. But, if it is quite a while since the last time, they may need to recall what had been done previously. Some steps may be skipped unintentionally, or extra steps may be introduces, which make the solution will take longer to solve.

What if the issue is handle by different people? they need to reinventing the wheel, going through step #4 to #6 again. The time needed more or less will be same as the first occurrence. This violate DRY principle.

DRY — “don’t repeat yourself”

Violation of this principle is called WET — “work every time”, “work everything twice”, “waste everyone’s time” [1]

Learning organization always try to learn from anything, including from incidents. To achieve this, one of the options is by writing Post Mortem Report for each incident.

Post Mortem Report usually contains the following information:

  1. Summary of incident — for example “On Friday 7 January 2022 users can’t create orders from 03:00 am to 04:00 am”
  2. Initial root causes, including later-blameless root causes
  3. Corrective actions to solve the issue — for example “restart service X”, “archive table T”
  4. Confirmed root causes, only those leading to the incident
  5. Actions timeline, timestamped actions from incident report to the last corrective action
  6. Lesson learned, what lesson and insight that is learned — for example “lack of infrastructure monitoring make it difficult to prevent this issue”
  7. Next actions, what are needed to improve and to avoid same incident in the future, this may include formulate preventive mechanism to detect condition leading the incident— for example “implement infrastructure monitoring”, “optimize flow to reduce database utilization”

Note that the purpose of post mortem report is NOT to blame someone, on the contrary, it should be blameless. Focus on lesson learned and next improvement.

The Learning Organization — Post Mortem Document as Organization Knowledge

By practicing Post Mortem Report, organization will gain knowledge from each incident, learn from it, and keep improving iteratively.

~ cheers

References:

[1] https://en.wikipedia.org/wiki/Don%27t_repeat_yourself, accessed 7 January 2022 03:31

--

--