HyperTrack experienced a major incident on May 23, 2025 affecting Orders and Dashboard and 1 more component, lasting 19m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating May 23, 2025, 06:25 PM UTC
We are currently investigating the issue and working to resolve it.
- investigating May 23, 2025, 06:25 PM UTC
We are continuing to investigate this issue. The issue emerged at 17:20 UTC.
- resolved May 23, 2025, 07:11 PM UTC
Issues have been resolved at 18:45 UTC. The team is gathering data for the postmortem and action steps to prevent future degradations.
- postmortem May 27, 2025, 07:15 PM UTC
# **Postmortem: System-Wide Outage Due to Database Degradation** **Incident Date:** May 23, 2025 **Time to Resolution:** 85 minutes **Status:** Resolved **Severity:** Critical \(P0\) ### **Summary** On May 23, 2025, our platform experienced a widespread outage due to degraded performance in our database infrastructure. Specifically, a set of read replicas were affected during the period. This degradation resulted in elevated error rates and unavailability across multiple APIs, including Orders, Workers, Places, and SDK-related services. The issue was fully resolved within 85 minutes. We understand how critical our services are to your operations and sincerely apologize for the disruption. ### **What Happened** A query pattern in our system targeting a key Orders API table failed to use a necessary index. This led to full table scans that overloaded some of our reader instances. As a result, several core APIs failed or experienced extreme latency. ### **Impact** * Customers experienced timeouts or errors when accessing Orders, Workers, and Places APIs * Monitoring and dashboard functionality was temporarily unavailable ### **What We Did** * Identified the problematic query * Deployed a hotfix to ensure proper index usage * Applied a secondary patch to reduce load when workers were not actively tracking * Restarted degraded infrastructure and monitored stabilization * Performed a full incident review across impacted components ### **Remediation and Next Steps** We are taking the following actions to ensure this does not happen again: * **Automated slow-query detection**: We’re enhancing our review pipeline with weekly audits and real-time alerting. * **Improved infrastructure alarms**: CPU and query performance alarms will provide earlier visibility into degradation. ### **Final Thoughts** We are committed to providing a stable and resilient platform. This incident has highlighted areas we must improve, and we’re taking swift action to reinforce our architecture. Thank you for your trust and patience.