Playbook
Payment incident triage playbook
Classify degrading signals into incident classes and route to the correct playbook and reference—without skipping evidence or inventing severity levels.
01
Objective
Classify an operational degradation from signals into an incident class, collect minimum evidence, and route to the correct playbook and reference—without defaulting to the wrong plane’s fix.
02
Prerequisites
- Operational signal catalog with internally defined thresholds.
- Incident taxonomy agreed across integration, operations, finance, and treasury.
- Correlation identifiers (payment_id, merchant_id, time window) available.
- Access to playbooks index and integration references.
03
Operational signals
- One or more catalog signals outside internal threshold (webhook recency, checkpoint lag, queue depth, drift, provider latency, payout backlog).
- Customer or support tickets clustering around stuck payments.
- Deploy, secret rotation, or rail change preceding degradation.
- Multiple signals degrading together (often provider + webhook + checkpoint lag).
04
Decision points
- Is this primarily detection, settlement, webhook, provider, reconciliation, or payout?
- Is evidence sufficient to open a specialized playbook—or is freeze/hold safer first?
- Does the incident affect period close, payout execution, or customer fulfillment?
- Should provider outage response precede exception triage?
05
Escalation paths
- Unclear class → payment operations lead facilitates joint triage.
- Material treasury or payout risk → finance controller before irreversible action.
- Sustained provider degradation → provider support with correlation ids.
- Security-sensitive webhook verification spike → security + integration engineering.
06
Failure modes
- Opening reconciliation close for a pure webhook verification deploy mistake.
- Confirming payments manually to clear checkpoint lag without audit trail.
- Treating provider outage as payment failure and reversing commerce state.
- Skipping signal catalog and jumping to ad-hoc explorer checks.
07
Recovery patterns
- Record time window, affected rails, and degrading signals with owners.
- Assign incident class using taxonomy on /incidents.
- Open primary playbook for that class; link secondary playbooks if needed.
- Attach supporting reference (state model, delivery expectations, signal catalog).
- Schedule post-incident matcher or observability gap review internally.
- Retries are normal. Webhook delivery is at-least-once. Design consumers to tolerate duplicates and out-of-order arrivals where possible.
- Asynchronous by design. Payers, chains, and your servers operate on different clocks. UI and finance should not assume synchronous finality.
- Eventual consistency. API reads, webhooks, and portal views may briefly diverge during transitions. Reconciliation jobs exist to converge truth.
Walkthroughs: /operations