DANAconnect incident
We are currently experiencing an outage affecting Conversations Processing. Our Engineering Team is aware and working to resolve.
DANAconnect experienced a notice incident on November 5, 2024 affecting Email platform and Director - Workflow Orchestration and 1 more component, lasting 2h 9m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Nov 05, 2024, 06:06 PM UTC
We are currently investigating this issue.
- identified Nov 05, 2024, 07:30 PM UTC
The issue has been identified and a fix is being implemented.
- monitoring Nov 05, 2024, 08:03 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Nov 05, 2024, 08:16 PM UTC
This incident has been resolved.
- postmortem Nov 06, 2024, 03:40 PM UTC
Descargar informe de reporte en español: [https://drive.google.com/file/d/1IiEQArry8AjBU4M-g6TPHQ95j-iK7EXo/view?usp=sharing](https://drive.google.com/file/d/1IiEQArry8AjBU4M-g6TPHQ95j-iK7EXo/view?usp=sharing) **Incident Report** **Incident ID:**CYMSLP-126-241105 **Date:**11/05/2024 ### **INCIDENT DESCRIPTION** **Incident reported by:** DevOps **Date and time of the incident:** 11/05/2024 - 12:30 PM \(GMT-4\) **Time elapsed from when the DevOps team noticed the incident until its resolution:** 12:30 PM - 4:00 PM \(GMT-4\) **3 hours 30 minutes** **Details about the services affected by the incident:** * High degradation in communication delivery speed.There was prolonged queuing in the delivery of communications across various channels, including email, SMS, and push notifications. * Both mass campaigns and transactional notifications were affected. * A group of platform users reported that communication processing was halted. **Severity of the incident \(Critical, High, Medium, Low\):**Critical. 3 hours 30 minutes since the initial incident report. **Frequency of this type of incident:**Very low: First occurrence in 15 years. ### **CAUSE OF THE INCIDENT** **Details about the vulnerability that caused the incident:** A very high number of write accesses to the tables managing the error logs of the communication orchestrator were detected. This occurs when a message cannot be processed for some reason. The most common reasons include: the recipient's address is invalid and the processing route is undefined. Although logging errors is a normal process within the platform, the massive access caused a bottleneck in writing to the table, halting message processing. **Example of contained query:**  **Degradation of insertion times in the error log:**  It was determined that the failure originated from the simultaneous activation of 8 conversations, totaling approximately 8 million messages with destination addresses in an invalid or empty format.  ### **CORRECTIVE ACTIONS** 1. Volume of transactions to process within the orchestrator verified by DevOps 12:30 PM \(GMT-4\) 2. Transaction slowness and wait times in the Database verified by DevOps 12:35 PM \(GMT-4\) 3. DANAconnect team was notified via Slack on the Incident channel 12:45 PM \(GMT-4\) 4. The injection of new activations that could be affected was paused 12:50 PM \(GMT-4\) 5. All clients were notified about the platform incident and posted on[ https://status.danaconnect.com](https://status.danaconnect.com) 1:06 PM \(GMT-4\) 6. A Security Snapshot of the Database Cluster was generated 1:10 PM \(GMT-4\) 7. It was determined that the origin of the platform degradation was the write access to the error log 1:56 PM \(GMT-4\) 8. The conversations causing the massive generation of error logs were located 2:51 PM \(GMT-4\) 9. The conversations that initiated the incident to prevent writing to the error log were stopped3:00 PM \(GMT-4\) 10. The corrective action did not improve platform performance to the expected level 3:10 PM \(GMT-4\) 11. It was decided to initiate a general truncation/cleanup of the queue accumulating the processing of messages that caused the incident3:30 PM \(GMT-4\) 12. All API/WS and Orchestrator services were restarted 3:50 PM \(GMT-4\) 13. An internal testing cycle by the DevOps team was initiated 3:55 PM \(GMT-4\) 14. The resolution of the incident was confirmed to the entire DANAconnect team internally via Slack - Incident Channel about 4:00 PM \(GMT-4\) **Details about the solution/patch implemented to resolve the incident:** * Cleaned up the messages that caused the failure at the processing queue level. * Test reports clearly indicate that the implemented solution is functioning. **Database during transaction processing / Incident errors**12:30 PM \(GMT-4\)_Note: The time on the graphs is local in AWS._  **Database after queue cleanup:**Processing was fully restored4:00 PM \(GMT-4\)_Note: The time on the graphs is local in AWS._  ### **ACTIONS TO PREVENT THE INCIDENT FROM RECURRING** **Specialized Monitoring:** * Configuring a special dashboard and alerts for excessive access to the error log table. This will alert the DevOps team to analyze the need for preventive or corrective action. * Monitoring dashboard for conversations writing errors: This will quickly identify which conversations may be generating massive bursts of writes to the error log. **Solution Changes:** * We are considering configuring a parameter that can be adjusted on-demand to disable writing to the error log. **Documentation:** * Added this new type of case and its solution to the Incident Response Plan to provide a quick and effective response.