Observability
Payment systems intelligence (conceptual)
Observability for crypto payment infrastructure is plane-aware: what integration engineers, operators, finance, and treasury each need to see—and what must never be collapsed into a single green indicator. These pages define signals and views, not live dashboards or published metrics.
Full references: Signal catalog · Dashboard model · Incident taxonomy & routing · Incident triage playbook
Operational signal catalog
Six bounded signals mature teams track internally. Thresholds are merchant-defined; Kobbopay does not publish SLA numbers or live telemetry on this site.
Signal catalog summary — define thresholds internally; this site does not publish live metrics.
| Signal | What it measures | Typical owner | Incident classes |
|---|---|---|---|
| Webhook recency | Time since the last successfully verified webhook was processed for a merchant environment—or per-endpoint if you shard … | Integration engineering / SRE | Webhook, Provider |
| Checkpoint lag | Elapsed time between lifecycle milestones (detection → eligibility → policy confirmation → finance reconciliation). | Payment operations | Settlement, Detection |
| Exception queue depth | Count of open, taxonomy-owned exceptions awaiting resolution—segmented by class and age bucket. | Operations / finance | Reconciliation, Settlement |
| Reconciliation drift | Persistent mismatch between commerce, provider, and finance plane states after matchers run—not one-off timing skew. | Finance reconciliation | Reconciliation |
| Provider latency | Response time and error rate for provider API reads/writes and webhook delivery attempts—observed from your integration … | Integration engineering | Provider, Webhook |
| Payout review backlog | Open payout or withdrawal requests awaiting treasury review, dual control, or ledger eligibility confirmation. | Treasury / finance | Payout, Reconciliation |
Time since the last successfully verified webhook was processed for a merchant environment—or per-endpoint if you shard consumers.
Healthy pattern: Recency stays within thresholds you define per traffic profile; occasional gaps align with known quiet periods.
Investigate when: Recency grows while commerce or provider planes show activity; spikes after deploys or secret rotation.
Elapsed time between lifecycle milestones (detection → eligibility → policy confirmation → finance reconciliation).
Healthy pattern: Lag distributions match rail and confirmation policy expectations documented internally.
Investigate when: Payments stall between checkpoints; lag grows faster than historical baseline for the same rail.
Count of open, taxonomy-owned exceptions awaiting resolution—segmented by class and age bucket.
Healthy pattern: Depth stable or draining during business hours; new items match known noise patterns.
Investigate when: Depth grows monotonically; aging items exceed internal review targets; single class dominates.
Persistent mismatch between commerce, provider, and finance plane states after matchers run—not one-off timing skew.
Healthy pattern: Drift items are rare, classified, and tied to known async windows.
Investigate when: Same payment_id fails matchers repeatedly; drift clusters by rail, merchant, or time window.
Response time and error rate for provider API reads/writes and webhook delivery attempts—observed from your integration boundary.
Healthy pattern: Latency and error rates within bands you track per environment; retries succeed without handler exhaustion.
Investigate when: Elevated timeouts; read failures block status reconciliation; retry storms correlate with consumer crashes.
Open payout or withdrawal requests awaiting treasury review, dual control, or ledger eligibility confirmation.
Healthy pattern: Backlog drains on schedule; holds are policy-driven with documented reasons.
Investigate when: Requests exceed recognized balance checks; backlog grows during unrelated settlement incidents.
Operational dashboard concepts
Role-oriented views prevent single-plane dashboards from hiding reconciliation drift, webhook gaps, or treasury risk. Design internal tooling against these questions—not vanity uptime percentages.
Role-oriented views — design your internal dashboards against these questions.
| View | Primary questions (sample) | Key signals |
|---|---|---|
| Finance view | Which payments are books-ready versus merely detected? | Reconciliation drift; Exception queue depth; Checkpoint lag (finance gates) |
| Integration engineer view | Are webhooks verified, idempotent, and recent? | Webhook recency; Provider latency; Checkpoint lag (detection → Paid) |
| Support / operator view | What lifecycle state should support quote to the customer? | Checkpoint lag; Exception queue depth; Webhook recency (indirect stuck states) |
| Treasury view | Which balances are recognized versus in-flight? | Payout review backlog; Checkpoint lag (recognition → posting); Reconciliation drift (ledger vs provider) |
| Executive health view | Are payment systems degrading by class (webhook, settlement, reconciliation)? | Aggregate signal trends you define internally; Incident class counts (not vanity uptime percentages); Exception queue aging buckets |
Primary questions
- Which payments are books-ready versus merely detected?
- Where do matchers fail across commerce, provider, and finance planes?
- What exceptions block period close?
Must not collapse: Provider Confirmed labels into treasury posted without reconciliation evidence.
Primary questions
- Are webhooks verified, idempotent, and recent?
- Where do handlers crash or exhaust retries?
- Which rails show elevated provider latency?
Must not collapse: HTTP 200 responses into successful side effects without idempotency persistence.
Primary questions
- What lifecycle state should support quote to the customer?
- Which exceptions are owned and within review?
- Is fulfillment allowed under merchant policy?
Must not collapse: Explorer screenshots or chat overrides into authoritative lifecycle truth.
Primary questions
- Which balances are recognized versus in-flight?
- What payout requests await dual control?
- Are settlement and payout rails aligned?
Must not collapse: Detected inbound funds into payout eligibility without recognition gates.
Primary questions
- Are payment systems degrading by class (webhook, settlement, reconciliation)?
- Where are open incidents concentrated?
- Is period close at risk from exception or drift trends?
Must not collapse: Multiple incident classes into a single uptime percentage without taxonomy.
From signals to action
When a signal degrades, classify the incident, then open the matching playbook. Start with payment incident triage when the class is unclear.