Is Acceldata Data Observability Cloud Down Right Now? Live Acceldata Data Observability Cloud Status & Outages | IsDown incident

Login Failure

Critical Resolved View vendor source →

Is Acceldata Data Observability Cloud Down Right Now? Live Acceldata Data Observability Cloud Status & Outages | IsDown experienced a critical incident on September 16, 2024, lasting —. The incident has been resolved; the full update timeline is below.

Started
Sep 16, 2024, 06:19 AM UTC
Resolved
Sep 15, 2024, 09:00 AM UTC
Duration
Detected by Pingoru
Sep 16, 2024, 06:19 AM UTC

Update timeline

  1. resolved Sep 16, 2024, 06:19 AM UTC

    Login failure was caused by Catalog Service restart Failure.

  2. postmortem Sep 19, 2024, 07:04 AM UTC

    ### **Incident Title:** Catalog Server Outage ### **Date of Incident:** September 15, 2024 1430 IST ### **Service\(s\) Impacted:** Catalog Server and Authentication Services ### **Impact on Business:** * Affected users : All Users * Affected functionality : Login, Data Reliability Functionality * Duration of downtime and recovery time: 3 hrs * Any financial or reputational impacts : TBD ‌ ‌ ## **1. Summary of the Incident** ‌ At 13:30 UTC on September 15,, our production system experienced an outage due to an unintentional Minor Patch application on RDS. This resulted in a disruption of the Catalog Server for all users until service restoration at 18:00 IST ‌ ‌ ## **2. Root Cause** Our **Production Database \(RDS\)** experienced an unexpected restart due to the automatic application of a **minor version upgrade**. This initiated a restart of the **Primary RDS instance**, resulting in a **failover** to the standby instance. Consequently, both the **Catalog Servers** and **Authentication Services** entered an erroneous state. * The **Catalog Server** lost its connection to the database, leading to an automatic restart. However, the simultaneous restarts of all Catalog Service instances caused **database entry locking**, which prevented any of the Catalog Service instances from starting successfully. * Additionally, a few **Authentication Service** instances were unable to reconnect to the active database instance until they were manually restarted. * Since the **UI** depends on the **Catalog Service**, users were unable to log in during this period. ‌ ## **3. Incident Timeline** | **Time** | **Event** | | --- | --- | | 13:30 | Primary RDS Restart | | 14:30 | Catalog Service restarts raised and Incident of priority P3 in Alert system | | 17:30 | Engineers started investigation | | 17:45 | Identified the root cause | | 17:55 | Cleaned up Locked Database entries | | 18:00 | All services restored | ## **4. Resolution and Recovery** * Cleaned up Database Entries which caused locking * Restarted Errored Authentication instances ‌ ## **5. Preventive Actions** * Adding Alerts for Database Failover * Remove dependency between catalog and UI for successful login * Reviewing of Priority of alerts ## **6. Action Items** List the concrete action items that need to be followed up on to ensure better preparedness for future incidents. | **Actions** | **Owner** | **Date** | | --- | --- | --- | | Adding Alerts for Database Failover | DevOps Team | Sept 19, 2024 | | Reviewing of Priority of alerts and Updation | DevOps Team | Sept 20, 2024 | | Remove dependency between catalog and UI for successful login | Engineering Team | Oct 15, 2024 |