mailprotector experienced a notice incident on December 5, 2022, lasting —. The incident has been resolved; the full update timeline is below.
Update timeline
- resolved Dec 05, 2022, 08:53 PM UTC
Additional resources to the filtering infrastructure were added at approximately 4:30 pm ET on Thursday, December 1, 2022. Roughly 11% of emails received through CloudFilter either bounced or deferred during: 12/1/22 4:30 pm to 6:00 pm 12/2/22 5:30 am to 2:00 pm The additional resources had misconfigurations which caused a failure to deliver email to the next hop in the CloudFilter mail flow. The resources were added to resolve a resource constraint pattern observed at the top of the hour during early business hours. Email messages that entered the queue on the new resources were locked into the Postfix process of performing SMTP retries until reaching the bounce threshold or delivery. The bounce threshold was 12 hours, creating a delay in observing the problem. We estimate that 54,000 messages across the above timeframe were bounced.
- postmortem Dec 05, 2022, 08:53 PM UTC
Additional resources to the filtering infrastructure were added at approximately 4:30 pm ET on Thursday, December 1, 2022. Roughly 11% of emails received through CloudFilter either bounced or deferred during: * 12/1/22 4:30 pm to 6:00 pm * 12/2/22 5:30 am to 2:00 pm The additional resources had misconfigurations which caused a failure to deliver email to the next hop in the CloudFilter mail flow. The resources were added to resolve a resource constraint pattern observed at the top of the hour during early business hours. Email messages that entered the queue on the new resources were locked into the Postfix process of performing SMTP retries until reaching the bounce threshold or delivery. The bounce threshold was 12 hours, creating a delay in observing the problem. We estimate that 54,000 messages across the above timeframe were bounced. **Detection** The incident was detected when Partner Success researched tickets that seemed to correlate to the mail delivery issues through the newly-deployed resources. Partner Success escalated the issues to the operations engineer, who gathered information and escalated it to the on-call engineer. **Recovery** The issue was resolved by removing the new resources from production service. However, several attempts were made to resolve the issue at different points throughout the outage window. Ultimately, the issue was resolved by adding new external IP addresses to Postfix configurations and the transport servers' security group. **Next Steps** The incident exposed gaps in the documentation of legacy mail infrastructure and processes for rolling out changes to the infrastructure. Several changes in the process will be implemented, including but not limited to: * Comprehensive run book for adding new resources to the CloudFilter cluster * Account for identified "blind spots" * New staging environments for validating changes * Additional key metrics for post-deployment observation * A review of the data collected during this incident is ongoing * Determine metrics that were missing * Different alerting or notifications to get ahead of partner reports from tickets The incident and affected resources are resolved. However, the team will continue implementing processes to prevent a repeat of mail flow performance problems. The effort does not end with resolving this incident, but rather a refocusing on the iterative improvement of stable infrastructure management. **Anticipated FAQs** * Can the bounced emails be resent? * No. SMTP \(Simple Mail Transport Protocol\) does not keep an email after it is bounced. It is removed from the queue. * Can I receive a list of emails that were bounced? * Unfortunately, no. The logs are not organized in a way to pull that information together. Individual log details show the SMTP response in the timeline, a manual process in the Console.