Is Acceldata Data Observability Cloud Down Right Now? Live Acceldata Data Observability Cloud Status & Outages | IsDown incident
Secure Relay Servers Outage
Is Acceldata Data Observability Cloud Down Right Now? Live Acceldata Data Observability Cloud Status & Outages | IsDown experienced a major incident on December 15, 2024, lasting —. The incident has been resolved; the full update timeline is below.
Update timeline
- resolved Dec 17, 2024, 04:18 AM UTC
Secure Relay Servers Unresponsive
- postmortem Jan 07, 2025, 02:25 PM UTC
**Postmortem Report on Production Outage of Secure Relay Server** 1. **Incident Summary** * **Date & Time:** 15th December 2024 2 PM to 16th December 2024 4 PM * **Duration:** 26 hours * **Impact:** Impacted all communications from Control Plane to all Dataplanes for Data Reliability feature. * **Severity Level:** High * **Affected Systems/Services:** Data Reliability \(Crawl, Profile, Policy Execution\), Cadence **2. Incident Timeline** | **Time \(IST\)** | **Event Description** | | --- | --- | | 16th December 2024 3:45 PM | Detection of the issue. | | 16th December 2024 3:50 PM | Investigation started by the \[Team\]. | | 16th December 2024 3:55 PM | Identified root cause or signs of root cause. | | 16th December 2024 4:00 PM | Fix or mitigation steps were applied. | | 16th December 2024 4:05 PM | Services fully restored. | | 16th December 2024 4:55 PM | Monitoring and validation of fixes. | **3. Root Cause Analysis \(RCA\)** * **Technical Root Cause:** Secure Relay servers, responsible for securely relaying data and communications between systems, ran out of disk space. This caused the system to exhaust all available storage, preventing critical operations like writing logs and handling temporary data. As a result, the service became unresponsive and failed to relay data effectively. * **Systems or Processes Involved:** Secure Relay Servers were the only systems involved in the outage. * **Monitoring Failure:** While monitoring was set up for the Secure Relay Load Balancer for unhealthy hosts, it was not integrated with Opsgenie. This oversight led to delays in detecting and responding to the disk space issue. **4. Resolution** * **Steps Taken:** Increased the disk space of Secure Relay servers. * **Long-term Fixes:** * Added comprehensive monitoring for Secure Relay servers, including service health and disk usage monitoring. * Integrated alerts for Load Balancer with Opsgenie to ensure proactive detection and escalation. * Review All Alerts to have action to integrate with OpsGenie **5. Lessons Learned** * **What Went Well:** Once issue was detected, issue was fixed immediately. All the Secure Relay Clients Connected Successfult * **What Didn’t Go Well:** It took long hours to detect the issue. * **Improvement Areas:** Review the alerts and make sure all alerts are integrated with Opsgenie **6. Action Items** | **Action** | **Owner** | **Deadline** | **Status** | | --- | --- | --- | --- | | Increase the Storage on Servers | Uday Shanbhag | 16th Decemeber 2024 | Closed | | Add Opsgenie integration for Load Balancer Health alert | Uday Shanbhag | 16th Decemeber 2024 | Closed | | Implement Secure Relay monitoring | Ritesh Mahajan | 16th Decemeber 2024 | Closed | | Implement Alerts for Secure Relay Systems | Ritesh Mahajan | 18th Decemeber 2024 | Open | **7. Conclusion** The outage on the Secure Relay Servers was caused due to an exhausted disk space, which resulted in unresponsive services and disrupted communication between critical systems. While the immediate fix of increasing disk space resolved the issue, the lack of proactive monitoring and integrated alerting systems contributed to the delayed response. Moving forward, the implementation of comprehensive monitoring, proactive alerting, and improved escalation workflows will significantly reduce the likelihood of recurrence. By learning from this incident, we are taking the necessary steps to strengthen our systems' resilience and improve overall operational efficiency. **Prepared by**:Uday Shanbhag **Date:** 17th Decmeber 2024