Playbook

Provider outage operational response

Webhook gaps, API unavailability, and reconciliation freeze patterns when infrastructure signals degrade.

01

Objective

Maintain operational control when provider API or webhook delivery degrades—using freeze patterns and customer communication without inventing uptime guarantees.

02

Prerequisites

  • Health checks on webhook recency and API error rates.
  • Reconciliation freeze procedure documented.
  • Status communication templates that avoid unverified promises.

03

Operational signals

  • Webhook recency lag beyond internal threshold.
  • API read failures for payment status.
  • Growing stuck Pending/Paid populations.

04

Decision points

  • Enable reconciliation freeze for auto-posting.
  • Pause high-risk fulfillment.
  • Switch to manual status checks if available.

05

Escalation paths

  • Operations → provider support with correlation ids.
  • Customer support → operations for ticket surge.

06

Failure modes

  • Assuming outage equals payment failure without evidence.
  • Disabling verification to accept unauthenticated callbacks.

07

Recovery patterns

  1. Backfill provider plane from event log after recovery.
  2. Re-run matchers for outage window.
  3. Clear freeze with finance sign-off.
  • Retries are normal. Webhook delivery is at-least-once. Design consumers to tolerate duplicates and out-of-order arrivals where possible.
  • Asynchronous by design. Payers, chains, and your servers operate on different clocks. UI and finance should not assume synchronous finality.
  • Eventual consistency. API reads, webhooks, and portal views may briefly diverge during transitions. Reconciliation jobs exist to converge truth.

Walkthroughs: /operations