Tracking 22 components across 5 regions · last incident closed 11 days ago
Worker fleet scaled up too slowly during a sudden burst of webhook deliveries. Added headroom in autoscaler configuration. No data loss.
One shard failed health check after a node restart. Replica promoted automatically; full re-balance completed in 13 minutes.
Misconfigured connection pool after a routine deploy. Rolled back via canary; root-cause review filed.