Paul S.

Checkout Observability & Reliability

Metrics, dashboards, alerts, error taxonomy, and production cutover monitoring for checkout systems.

Context

Checkout was running across multiple new services and migrations. We needed observability that matched how the system actually worked, not generic infra dashboards, so we could ship faster without losing safety.

Problem

Existing dashboards mixed signal and noise. Errors were classified inconsistently across services, alerts paged for the wrong things, and cutovers lacked a clear go/no-go view.

What I did

  • Defined a metrics taxonomy and error framework shared across Shop Platform and Checkout services.
  • Built dashboards and alerts tied to user-visible behavior — latency percentiles, error rate by class, partner success, and revenue parity.
  • Set up cutover monitoring views and rollback signals used during migrations.
  • Worked with on-call to tune alerting so pages reflected real incidents, not flaky noise.

Result

  • Faster production debugging and clearer operational signal.
  • Safer cutovers — go/no-go decisions backed by a small set of trusted metrics.
  • Less alert fatigue for on-call engineers.

Tech

  • OpenTelemetry
  • Prometheus
  • Micrometer
  • Datadog
  • PagerDuty
  • Sentry
  • AWS
  • EKS