Uploadcare incident

Webhook Service Degradation

Major Resolved View vendor source →

Uploadcare experienced a major incident on September 25, 2024 affecting Webhooks, lasting 2h 11m. The incident has been resolved; the full update timeline is below.

Started
Sep 25, 2024, 10:07 AM UTC
Resolved
Sep 25, 2024, 12:19 PM UTC
Duration
2h 11m
Detected by Pingoru
Sep 25, 2024, 10:07 AM UTC

Affected components

Webhooks

Update timeline

  1. investigating Sep 25, 2024, 12:28 PM UTC

    We're experiencing a slowdown in our Webhooks service.

  2. resolved Sep 25, 2024, 12:29 PM UTC

    This incident has been resolved. We apologize for any inconvenience this may have caused.

  3. postmortem Sep 26, 2024, 08:55 AM UTC

    ## Incident Summary On 25 September 2024, an issue with webhook delivery was identified, affecting clients between 10:07 and 12:19 UTC. The delay impacted webhook notifications, with no data loss but a significant delay in processing and delivery. ## Timeline * 10:07 UTC – A system configuration change was made, which inadvertently disrupted webhook processing. * 10:07 UTC – Webhook delivery issues began. * 12:05 UTC – The problem was identified and resolved, with backlogged webhooks being processed. * 12:12 UTC – The first webhook was successfully delivered after the fix. * 12:19 UTC – All queued events were processed, with delivery confirmed for all affected users. ## Root Cause The issue was caused by a configuration change that resulted in the webhook delivery system not processing events correctly. Despite initial signs of system health, the disruption went undetected due to gaps in the system’s monitoring tools. ## Impact * Webhook delivery was delayed for approximately 2 hours. * Customers experienced delays in receiving event notifications. * No data was lost, but delivery delays were significant due to a backlog in event processing. ## Challenges During Resolution * Monitoring systems indicated that components of the webhook system were healthy, which delayed identification of the underlying problem. ## Resolution * Webhook processing was restarted, and we verified that all queued events were delivered without any data loss. * The incident was fully resolved by 12:19 UTC, with all webhooks processed and delivered. ## Action Items ### Short-term * Improve the system’s monitoring and alerting to better detect issues with webhook processing. ## Long-term * Explore options to improve the resilience of our webhook delivery system, including scaling the infrastructure to better handle failures.