Stream incident

High error rates and timeouts

Minor Resolved View vendor source →

Stream experienced a minor incident on January 28, 2020 affecting us-east, lasting 46m. The incident has been resolved; the full update timeline is below.

Started
Jan 28, 2020, 04:12 PM UTC
Resolved
Jan 28, 2020, 04:58 PM UTC
Duration
46m
Detected by Pingoru
Jan 28, 2020, 04:12 PM UTC

Affected components

us-east

Update timeline

  1. monitoring Jan 28, 2020, 04:49 PM UTC

    A recent released caused load increase on part of the chat infrastructure and caused degraded performance and timeout errors. Remediation is in progress.

  2. monitoring Jan 28, 2020, 04:56 PM UTC

    We are continuing to monitor for any further issues.

  3. monitoring Jan 28, 2020, 04:57 PM UTC

    We are continuing to monitor for any further issues.

  4. resolved Jan 28, 2020, 04:58 PM UTC

    This incident has been resolved.

  5. postmortem Jan 28, 2020, 04:59 PM UTC

    Between 4:05PM and 4:45PM UTC on January 28 2020 we had an API outage caused by performance degradation. The event was triggered by a new release to our Chat API servers; quickly after the new release was live, load on our database infrastructure increased and caused HTTP response times to spike and time-out in some cases. The event was detected by our latency and error monitoring. The team started working on the event by rolling back to the previous version at 4:20PM UTC. Unfortunately the rollback did not resolve the problem entirely. After another rollback attempt we realised there were still pending queries from the previous release running on our PostgreSQL database. We manually terminated all the pending tasks at 4:40PM UTC; after that the error rate dropped to 0% again. The outage affected 5% of HTTP requests at its peak \(4:20PM to 4:27PM UTC\).