Higher Logic incident

Higher Logic Thrive Marketing Enterprise (Real Magnet) - Sending Delays

Higher Logic experienced a notice incident on January 17, 2024 affecting Marketing Enterprise (Real Magnet), lasting 6h 15m. The incident has been resolved; the full update timeline is below.

Started: Jan 17, 2024, 01:42 PM UTC
Resolved: Jan 17, 2024, 07:58 PM UTC
Duration: 6h 15m
Detected by Pingoru: Jan 17, 2024, 01:42 PM UTC

Affected components

Marketing Enterprise (Real Magnet)

Update timeline

investigating Jan 17, 2024, 01:42 PM UTC

We are experiencing message sending delays for all customers. Our Engineering team is investigating the issue. If you have a message that should've been sent by now, please leave the message as they are and they will be sent out once the delay has been resolved. This issue will also affect message tracking data. We apologize for the inconvenience and appreciate your patience.
monitoring Jan 17, 2024, 02:56 PM UTC

Our Engineering team identified and corrected the issue and we see messages are starting to catch up. We'll continue to monitor this throughout the day before resolving. We plan to have a root cause analysis (RCA) ready for distribution in the next 3 business days.
resolved Jan 17, 2024, 07:58 PM UTC

We've been monitoring for the last few hours and all is going well. We are marking this issue resolved. We plan to have a root cause analysis (RCA) ready for distribution in the next 3 business days and will post it here.
postmortem Jan 19, 2024, 08:22 PM UTC

**Date: January 17, 2024** **What Happened:** A subset of customers on Higher Logic Marketing Enterprise \(Real Magnet\) experienced a disruption in their ability to send emails. The disruption began late Tuesday morning and persisted until 9:21 AM on Wednesday. The problem stemmed from a server issue preventing this set customers from sending emails from our platform. **Timeline \(All times in EST\):** 1/16/2024 > 1:07 PM – Outbound mail server application \(MTA\) entered a faulted state and automatically > restarted. The application remained in an undetected, faulted state after restarting. > 3:05 PM – Approximate time of first customer report of issue received. > 3:11 PM – Issue escalated to Application Engineering team; technical investigation started. > 5:36 PM – Issue escalated to Development team; investigation continued. > 6:20 PM – Investigation indicated that email was being delivered; consequently, the Issue was > escalated to non-on-call resource. 1/17/2024 > 7:55 AM – Investigation resumed. > 9:06 AM – Issue escalated to Platform team. > 9:13 AM – Impacted server restarted. > 9:21 AM – Inbound and outbound email traffic returned to normal; outbound queued traffic > being delivered. **Root Cause:** The MTA \(sending software\) application on one mail server crashed and restarted in an unstable state on Tuesday at around 1:07 PM ET. **Details:** The sending service was restored by restarting the mail server. Once this server was rebooted, we immediately began seeing outbound mail from that server for the impacted clients. The troubleshooting on January 16 indicated that email was being delivered; however, the email servers were configured in a parallel state such that email queued on one server would not be delivered until the fault was corrected on that server while email routed to other servers was being delivered. The result was that the email was queued but not delivered while the single server was in a faulted state. This incorrect diagnosis delayed the escalation and resolution of the problem. The faulted state was one that had not been previously observed. Monitoring and alerting should have detected the partial failed state and more clearly reported the condition to technical staff. **Corrective Actions:** * Working with MTA software vendor to determine the cause of the fault and any remediation necessary to prevent future faults. * Further investigation and training on email delivery via the MTA to better understand the current condition of our email delivery. * Improve monitoring/alerting on the server process to better report email delivery failures. * Simplify notification/escalation process and educate staff with clear guidelines to facilitate afterhours escalations.