Avochato incident

Application Latency

Avochato experienced a minor incident on February 12, 2021 affecting avochato.com and API and 1 more component, lasting 1h 32m. The incident has been resolved; the full update timeline is below.

Started: Feb 12, 2021, 07:43 PM UTC
Resolved: Feb 12, 2021, 09:16 PM UTC
Duration: 1h 32m
Detected by Pingoru: Feb 12, 2021, 07:43 PM UTC

Affected components

avochato.comAPIMobile

Update timeline

investigating Feb 12, 2021, 07:43 PM UTC

We are investigating slower than average inbox load times and inbox live updates.
identified Feb 12, 2021, 07:53 PM UTC

Our team has identified the issue and is working to speed up the queued messages and events.
monitoring Feb 12, 2021, 07:54 PM UTC

Average load times have returned to normal and we are continuing to monitor cloud infrastructure performance.
monitoring Feb 12, 2021, 08:00 PM UTC

We are still seeing some higher than average load times for specific inboxes and are continuing to investigate.
monitoring Feb 12, 2021, 08:10 PM UTC

We are still seeing elevated load times and are applying diagnostics to reduce the impact on application performance.
monitoring Feb 12, 2021, 08:21 PM UTC

We are observing average load times returning to normalcy. We will continue monitoring cloud infrastructure performance.
resolved Feb 12, 2021, 09:16 PM UTC

This incident has been resolved. We will continue monitoring throughout the day.
postmortem Feb 12, 2021, 10:02 PM UTC

## What Happened Platform automation handling live notifications for messages led to excessive queueing of updates to critical tables in our write database, which in turn led to longer turnaround times for live updates inside the inbox. After identifying the root cause, Engineering deployed a fix \(which succeeded in stemming the root cause\), but in the meantime, we over-scaled our workers to adjust to their load which caused connection issues for non-workers. This led to an escalation in page load times while our congested databases could no longer serve our applications in a timely manner. Our cloud operations automatically scaled to handle the increased pressure from the root cause while we resolved the issue, but over-scaled improperly, leading to increased load times when accessing the app. Load times additionally spiked as we swapped the databases then returned to normal. During this period, one specific database seemingly had failure despite no usage. This led to a second period of increased load times after a brief reprieve. This database was safely replaced and servers were were routed to a replacement, and load times returned to normal. ## Impact Initial delays in receiving live inbox updates followed by high page load times when viewing the app. Delays in updating conversations and contacts. Due to delays in receiving live updates, some messages that appeared to be double-sent manually were delivered exactly once as intended. ‌ ## Resolution The team has already deployed updates to prevent the root cause from occurring. We have made adjustments to prevent the second issue including adjusting the maximum connections to the write database as well as safely replacing the faulty database that appeared to failover during this period. Our engineering team will audit configurations that led to the lack of cloud resources when auto-scaling.