Runbook — Pro Scale Team

§ 01 · POSTMORTEM TEMPLATE

The postmortem we actually want to reread.

Most postmortems fail because they are written for an audience of leadership who will read them once. The ones worth writing are read three times: once at publication, once during the next incident that looks like this one, and once a year later by the engineer who inherited the system.

# Incident <short slug>  ·  <date>  ·  Severity <1–4>

## Summary (two sentences, read-aloud test)
## Customer impact (who, how many, for how long, in what currency)
## Timeline (UTC, one line per event, no commentary)
## What went wrong (mechanism, not blame)
## What went right (kept impact from being worse)
## Where we got lucky (tell the truth here)
## Detection  (who/what noticed, how long after onset)
## Mitigation (what actually stopped the bleeding)
## Contributing factors (ordered, not exhaustive)
## Action items  (owner · due date · size · reversibility)
## Glossary of terms used (for the engineer in 18 months)

Rules: no names in the timeline. "Where we got lucky" is not optional. Action items have owners or they are not action items.

§ 02 · ADR TEMPLATE

Architecture decision record.

One file, numbered, in /docs/adr/. Immutable once merged; superseded by a later ADR, never edited. The point is not to be right; the point is to leave behind why we thought we were right at the time.

# ADR-NNNN: <short title in sentence case>

Status:   proposed | accepted | superseded by ADR-NNNN
Date:     YYYY-MM-DD
Deciders: <github handles>

## Context
What forces are in play. What we tried first and why it didn't hold.

## Decision
The thing we are doing. Present tense. One paragraph.

## Alternatives considered
Two or three, with the reason each was rejected.
Include the alternative you secretly prefer but can't justify.

## Consequences
What becomes easier. What becomes harder.
What the reversal plan is. What we will measure to know this was right.

§ 03 · ON-CALL, FIRST 15 MINUTES

The first fifteen minutes of a page.

A pocket card we leave on every on-call engineer's desk, or the digital equivalent pinned in their #oncall channel. It is deliberately short. A panicking human cannot read more than this.

Acknowledge within two minutes. Even if you have no idea what's wrong. The page owns you until you own it.
Open the three dashboards. Traffic. Errors. Saturation. In that order. Named in the pager message.
Declare. If customer-facing, declare an incident before investigating further. Declaration is cheap; under-declaration is not.
Find the edge. When did the metric change? What deployed, flagged, or was migrated within twenty minutes of that edge?
Stop the bleeding first, diagnose second. Rollback, flag flip, rate limit, or drain before you go reading code.
Write as you go. Paste into the incident channel every thirty seconds. Your future postmortem author is you.
Ask for a second pair of eyes at minute ten. Not at minute forty.

§ 04 · SLO MENU

A starter menu of SLOs that survive first contact with a PM.

The trap: teams set SLOs, then discover the SLO burns through half its error budget in a normal week, because the SLI is measuring the wrong thing. These are the five we most often propose as starting points, each with the wrong version of itself noted.

SLI	Objective	Window	Footgun
API availability	99.9% of non-4xx responses	30 days rolling	Counting 429s as failures burns budget on well-behaved rate limits.
API p99 latency	< 400 ms	30 days rolling	Measured at the load balancer, not in the client; you'll miss DNS and TLS.
Async job freshness	95% of jobs complete within 60s of enqueue	7 days rolling	"Completed" that counts retries as success hides a DLQ problem.
Webhook delivery	99.5% delivered within 5 minutes	30 days rolling	Stopping the clock at "first attempt" instead of "acknowledged".
Deploy lead time	Median PR merge → prod < 30 minutes	7 days rolling	Excluding failed deploys from the median; those are the ones that matter.

§ 05 · GLOSSARY

A small glossary, because not every reader has been in the room.

Blast radius: The set of customers, services, or data that a given failure can reach. Measured in nouns, not percent.
DLQ: Dead-letter queue. Where jobs go after N failures so they can be inspected without blocking live traffic. Only useful if someone is on the hook for draining it.
Decorrelated jitter: Retry backoff that chooses a random delay from an expanding range, so retrying clients don't resynchronize and hammer the recovering service in lockstep.
Error budget: The amount of unreliability permitted by an SLO in a given window. Spent well when used to take calculated risks; spent poorly when leadership treats hitting zero as a personal failing.
Shadow read / dual write: Two techniques for de-risking a data migration: during shadow read, the new system answers alongside the old one and disagreements are logged; during dual write, both stores receive every mutation until the new one is trusted. Together, they are how cutovers happen without a maintenance window.
SLO: Service-level objective. A written commitment about a measurable property of the service, with a window and a target.
Toil: Operational work that scales linearly with service size and produces no lasting value. The correct amount is small and decreasing.

What we leave behind.

The postmortem we actually want to reread.

Architecture decision record.

The first fifteen minutes of a page.

A starter menu of SLOs that survive first contact with a PM.

A small glossary, because not every reader has been in the room.