OpenNode incident

High database load

Major Resolved View vendor source →

OpenNode experienced a major incident on May 14, 2024, lasting —. The incident has been resolved; the full update timeline is below.

Started
May 14, 2024, 07:01 PM UTC
Resolved
May 14, 2024, 05:30 AM UTC
Duration
Detected by Pingoru
May 14, 2024, 07:01 PM UTC

Update timeline

  1. resolved May 14, 2024, 07:01 PM UTC

    On May 14, 2024, OpenNode experienced an incident with our production database, resulting in performance degradation and service disruption. The root cause of the issue was identified as a deprecated API loading excessive data, leading to a database overload.

  2. postmortem May 14, 2024, 07:01 PM UTC

    **Summary:** On May 14, 2024, OpenNode experienced an incident with our production database, resulting in performance degradation and service disruption. The root cause of the issue was identified as a deprecated API loading excessive data, leading to a database overload. ‌ **Timeline of Events:** * **10:30 PM PST:** Issue Detected * Our monitoring systems alerted us to unusual activity within the production database, indicating potential performance issues. ‌ * **11:30 PM PST:** Service restart * A decision was made to restart auxiliary services to alleviate any immediate issues and restore stability to the system. ‌ * **12:30 AM PST:** Database Workload Issue Identified and Fix Deployed * Upon further investigation, it was determined that the root cause of the problem was a deprecated API that was loading an excessive amount of data into the database, causing an overload. * A fix was promptly developed and deployed to address the issue and mitigate further impact on system performance. ‌ * **1:30 AM PST:** Monitoring Fix and Stability * Additional measures were taken to enhance monitoring systems and ensure the stability of the database environment. ‌ * **2:00 AM PST:** Issue Confirmed Fixed * Following the deployment of the fix, monitoring systems indicated a return to normal database workload and performance levels. * Our internal testing confirmed that the issue was resolved, and normal service operations were restored.