Avochato incident

Client Latency

Major Resolved View vendor source →

Avochato experienced a major incident on October 28, 2020 affecting avochato.com, lasting 6h 48m. The incident has been resolved; the full update timeline is below.

Started
Oct 28, 2020, 04:20 PM UTC
Resolved
Oct 28, 2020, 11:09 PM UTC
Duration
6h 48m
Detected by Pingoru
Oct 28, 2020, 04:20 PM UTC

Affected components

avochato.com

Update timeline

  1. investigating Oct 28, 2020, 04:20 PM UTC

    We are currently investigating this issue.

  2. identified Oct 28, 2020, 05:02 PM UTC

    The issue has been identified and a fix is being implemented.

  3. identified Oct 28, 2020, 06:17 PM UTC

    Our team has taken steps to mitigate platform latency which has improved but not resolved performance. We are continuing to monitor performance.

  4. monitoring Oct 28, 2020, 06:25 PM UTC

    A fix has been implemented and we are monitoring the results.

  5. resolved Oct 28, 2020, 11:09 PM UTC

    This incident has been resolved.

  6. postmortem Oct 29, 2020, 10:34 PM UTC

    ## What happened A large spike in network requests combined with a backlog automated usage led to the Avochato platform queueing HTTP requests for a longer than average period of time. The resulting callbacks that resulted from the spike in usage created a large backlog of work to be done by our servers and led to page load times to spike and delays in processing sending messages. Subsequently, the load-balancer for our platform ran out of available connections for HTTP requests as websocket escalations piled up due to our users refreshing their browsers during the period of degraded performance. This caused a negative feedback loop leading to longer delays to process requests and connect to live updates, which then contributed to live updates for inboxes and conversations continueing to be intermittent and HTTP requests being dropped. ## Action items Specific bottlenecks in our platform infrastructure’s ability to broker websockets have been identified and implemented. Some additional updates to our asynchronous architecture are being planned and prioritized to prevent a similar incident in the future.