Courier experienced a minor incident on July 14, 2022 affecting API, lasting 7h 31m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jul 14, 2022, 09:34 PM UTC
We are currently investigating an issue that is affecting send times for some messages.
- identified Jul 14, 2022, 10:50 PM UTC
The issue has been identified and a resolution is being deployed to our production services.
- identified Jul 15, 2022, 01:26 AM UTC
We are continuing to work towards resolution of the issue. We currently are seeing delays of approximately 2 hours for some message delivery
- monitoring Jul 15, 2022, 04:49 AM UTC
A fix has been implemented and we are monitoring system health. All backlogged messages are being processed.
- resolved Jul 15, 2022, 05:06 AM UTC
The incident has been resolved.
- postmortem Jul 19, 2022, 04:53 PM UTC
### Impact Courier experienced delayed message delivery in its send pipeline impacting 0.1% of messages from 12:50pm to 21:50pm PT on 7/14. No messages were dropped as a result of the incident. 99.9% of send calls experienced no delivery delay. The average message send delay was 3 hours and 20 minutes for impacted messages. #### Root Cause Courier uses feature flags to safely roll out new features. Due to a misconfiguration of a flag, a larger than expected volume of send requests were included in a validation experiment meant to verify a refactor of the send pipeline was safe to rollout. These requests added significant additional load on key stages of the send pipeline, and caused non-validation related requests to queue. #### Remediation Courier incrementally scaled up processing capacity in the send pipeline to work through the large accumulated backlog of messages. Additionally, a hotfix release was pushed to production in order to drop validation messages that had already entered the send pipeline. #### Follow up actions * Courier has established a process to better validate flag configuration in the future, as well as made changes to its feature flag helper library to make use less error-prone. * Courier has created an incident playbook to guide on-call engineers through options to quickly scale up message processing in the send pipeline.