Courier incident

Message send delays

Minor Resolved View vendor source →

Courier experienced a minor incident on July 14, 2022 affecting API, lasting 7h 31m. The incident has been resolved; the full update timeline is below.

Started
Jul 14, 2022, 09:34 PM UTC
Resolved
Jul 15, 2022, 05:06 AM UTC
Duration
7h 31m
Detected by Pingoru
Jul 14, 2022, 09:34 PM UTC

Affected components

API

Update timeline

  1. investigating Jul 14, 2022, 09:34 PM UTC

    We are currently investigating an issue that is affecting send times for some messages.

  2. identified Jul 14, 2022, 10:50 PM UTC

    The issue has been identified and a resolution is being deployed to our production services.

  3. identified Jul 15, 2022, 01:26 AM UTC

    We are continuing to work towards resolution of the issue. We currently are seeing delays of approximately 2 hours for some message delivery

  4. monitoring Jul 15, 2022, 04:49 AM UTC

    A fix has been implemented and we are monitoring system health. All backlogged messages are being processed.

  5. resolved Jul 15, 2022, 05:06 AM UTC

    The incident has been resolved.

  6. postmortem Jul 19, 2022, 04:53 PM UTC

    ### Impact Courier experienced delayed message delivery in its send pipeline impacting 0.1% of messages from 12:50pm to 21:50pm PT on 7/14. No messages were dropped as a result of the incident. 99.9% of send calls experienced no delivery delay. The average message send delay was 3 hours and 20 minutes for impacted messages. #### Root Cause Courier uses feature flags to safely roll out new features. Due to a misconfiguration of a flag, a larger than expected volume of send requests were included in a validation experiment meant to verify a refactor of the send pipeline was safe to rollout. These requests added significant additional load on key stages of the send pipeline, and caused non-validation related requests to queue. #### Remediation Courier incrementally scaled up processing capacity in the send pipeline to work through the large accumulated backlog of messages. Additionally, a hotfix release was pushed to production in order to drop validation messages that had already entered the send pipeline. #### Follow up actions * Courier has established a process to better validate flag configuration in the future, as well as made changes to its feature flag helper library to make use less error-prone. * Courier has created an incident playbook to guide on-call engineers through options to quickly scale up message processing in the send pipeline.