DANAconnect incident

We are currently experiencing an outage affecting Conversations Processing. Our Engineering Team is aware and working to resolve.

DANAconnect experienced a notice incident on November 5, 2024 affecting Email platform and Director - Workflow Orchestration and 1 more component, lasting 2h 9m. The incident has been resolved; the full update timeline is below.

Started: Nov 05, 2024, 06:06 PM UTC
Resolved: Nov 05, 2024, 08:16 PM UTC
Duration: 2h 9m
Detected by Pingoru: Nov 05, 2024, 06:06 PM UTC

Affected components

Email platformDirector - Workflow OrchestrationDo Not Contact List APIWebhooks (API Request Node)One Time Password APIConversation APISMTP ServiceBulk Contacts Load API

Update timeline

investigating Nov 05, 2024, 06:06 PM UTC

We are currently investigating this issue.
identified Nov 05, 2024, 07:30 PM UTC

The issue has been identified and a fix is being implemented.
monitoring Nov 05, 2024, 08:03 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Nov 05, 2024, 08:16 PM UTC

This incident has been resolved.
postmortem Nov 06, 2024, 03:40 PM UTC

Descargar informe de reporte en español: [https://drive.google.com/file/d/1IiEQArry8AjBU4M-g6TPHQ95j-iK7EXo/view?usp=sharing](https://drive.google.com/file/d/1IiEQArry8AjBU4M-g6TPHQ95j-iK7EXo/view?usp=sharing) ‌ **Incident Report** **Incident ID:**CYMSLP-126-241105 **Date:**11/05/2024 ### **INCIDENT DESCRIPTION** **Incident reported by:** DevOps **Date and time of the incident:** 11/05/2024 - 12:30 PM \(GMT-4\) **Time elapsed from when the DevOps team noticed the incident until its resolution:** 12:30 PM - 4:00 PM \(GMT-4\) **3 hours 30 minutes** **Details about the services affected by the incident:** * High degradation in communication delivery speed.There was prolonged queuing in the delivery of communications across various channels, including email, SMS, and push notifications. * Both mass campaigns and transactional notifications were affected. * A group of platform users reported that communication processing was halted. **Severity of the incident \(Critical, High, Medium, Low\):**Critical. 3 hours 30 minutes since the initial incident report. **Frequency of this type of incident:**Very low: First occurrence in 15 years. ‌ ### **CAUSE OF THE INCIDENT** **Details about the vulnerability that caused the incident:** A very high number of write accesses to the tables managing the error logs of the communication orchestrator were detected. This occurs when a message cannot be processed for some reason. The most common reasons include: the recipient's address is invalid and the processing route is undefined. Although logging errors is a normal process within the platform, the massive access caused a bottleneck in writing to the table, halting message processing. **Example of contained query:** ![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXe6YfPiesPky_xpPklQcMKGaeps278uPPjkO-7cJuV5Sg2nxLktVRSOfK6m3C399NBQBbolCd0T-h_1s-3w2WrC8oD5gtQhVfH5fzHNBirWBmj6CtZM5kdsezUCeCZbqyakFjgArdZWSTlrNyUqIE2EJeE?key=7uF-LOEy6t_6kK1q-fzM39gH) **Degradation of insertion times in the error log:** ![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXdgA7EXPft0voKAUY3NVcL43uTjkF6Nfb-MQ6hdkO9eHFdFVFfRRajyLaXJ1FPLTRUTzxH1st99As_CniyFdFg66BhsSGxoEEhAPpGs1jkRZlePJPDH2Dff2ZIUrGWcAQ2hPl1D_euWq1ZY_IrPKMXdUtw?key=7uF-LOEy6t_6kK1q-fzM39gH) It was determined that the failure originated from the simultaneous activation of 8 conversations, totaling approximately 8 million messages with destination addresses in an invalid or empty format. ![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXcNwrkK_GtdkZEJQoEhesyY8qPrffVdKYjpKT-b4EijPW9rM9ZjYQSjzV4zBqSgx9zbsnynTB_ACjDtxN_yjZBazSZECCCOY-Ca2IdSlJYf8jo49UWmkoQUz21hgkiGhBcx1wG9QZ8yVj7Mptz0aNCFUZpg?key=7uF-LOEy6t_6kK1q-fzM39gH) ‌ ‌ ### **CORRECTIVE ACTIONS** 1. Volume of transactions to process within the orchestrator verified by DevOps 12:30 PM \(GMT-4\) 2. Transaction slowness and wait times in the Database verified by DevOps 12:35 PM \(GMT-4\) 3. DANAconnect team was notified via Slack on the Incident channel 12:45 PM \(GMT-4\) 4. The injection of new activations that could be affected was paused 12:50 PM \(GMT-4\) 5. All clients were notified about the platform incident and posted on[ https://status.danaconnect.com](https://status.danaconnect.com) 1:06 PM \(GMT-4\) 6. A Security Snapshot of the Database Cluster was generated 1:10 PM \(GMT-4\) 7. It was determined that the origin of the platform degradation was the write access to the error log 1:56 PM \(GMT-4\) 8. The conversations causing the massive generation of error logs were located 2:51 PM \(GMT-4\) 9. The conversations that initiated the incident to prevent writing to the error log were stopped3:00 PM \(GMT-4\) 10. The corrective action did not improve platform performance to the expected level 3:10 PM \(GMT-4\) 11. It was decided to initiate a general truncation/cleanup of the queue accumulating the processing of messages that caused the incident3:30 PM \(GMT-4\) 12. All API/WS and Orchestrator services were restarted 3:50 PM \(GMT-4\) 13. An internal testing cycle by the DevOps team was initiated 3:55 PM \(GMT-4\) 14. The resolution of the incident was confirmed to the entire DANAconnect team internally via Slack - Incident Channel about 4:00 PM \(GMT-4\) **Details about the solution/patch implemented to resolve the incident:** * Cleaned up the messages that caused the failure at the processing queue level. * Test reports clearly indicate that the implemented solution is functioning. **Database during transaction processing / Incident errors**12:30 PM \(GMT-4\)_Note: The time on the graphs is local in AWS._ ![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXdQGeyIe-OrF5xYYN4W0ZjxTaEyizzGJnxgoD0KP_jHB4kSX8m9gDfDUr4qPllOVncReG8A-XrApap-myZnba4ud8gD2GL5-YRakw3O9FqgIIxBLeEP9xKKYEljRqGEnjkZOy6ITtSprgHml7Q_E_27Mpc?key=7uF-LOEy6t_6kK1q-fzM39gH) ‌ **Database after queue cleanup:**Processing was fully restored4:00 PM \(GMT-4\)_Note: The time on the graphs is local in AWS._ ![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXej7hvbfRpfNI82QYzIfcJf2m1LLBIYwzvVXHUCJi8AMi3MhMPZbKLv-e98U_db7-5yKNCO_ZqDDu5IyluVAB3DTEh30nkdtthA0PrpZGHf91kKKWL-ns3lDiUhQ-ylJYEwLyIiNV3eunHDNpBgJGxhBpF2?key=7uF-LOEy6t_6kK1q-fzM39gH) ‌ ‌ ### **ACTIONS TO PREVENT THE INCIDENT FROM RECURRING** **Specialized Monitoring:** * Configuring a special dashboard and alerts for excessive access to the error log table. This will alert the DevOps team to analyze the need for preventive or corrective action. * Monitoring dashboard for conversations writing errors: This will quickly identify which conversations may be generating massive bursts of writes to the error log. **Solution Changes:** * We are considering configuring a parameter that can be adjusted on-demand to disable writing to the error log. **Documentation:** * Added this new type of case and its solution to the Incident Response Plan to provide a quick and effective response.