Spacelift incident

Degradation of the GraphQL API

Spacelift experienced a notice incident on June 8, 2025, lasting 18m. The incident has been resolved; the full update timeline is below.

Started: Jun 08, 2025, 06:00 AM UTC
Resolved: Jun 08, 2025, 06:18 AM UTC
Duration: 18m
Detected by Pingoru: Jun 08, 2025, 06:00 AM UTC

Update timeline

resolved Jun 11, 2025, 03:31 PM UTC

Between 06:18 and 06:38 our GraphQL API experienced an increased latency and a high error rate, causing most requests to fail. This was ultimately due to an issue with the autovacuum process of our Postgres database. More details are available in the post-mortem.
postmortem Jun 11, 2025, 03:32 PM UTC

**Summary** Between June 8 and June 9, 2025, multiple service disruptions occurred due to increased latency and request failures across HTTP APIs. These incidents were traced back to the behavior of the PostgreSQL autovacuum process and were addressed by operational mitigations and subsequent investigations. **Incident Timeline** June 8, 2025: * 05:18 – 05:38 UTC: Service degradation. * 21:20 – 21:40 UTC: Another period of degradation. June 9, 2025: * 02:44 – 03:05 UTC: Additional disruption. * Post-incident: Investigation and fix deployed. **Root Cause** An unexpected PostgreSQL **autovacuum behavior** \(specifically, autovacuum acquiring an **ACCESS EXCLUSIVE** lock, which is highly unusual\) caused a series of latency spikes and request failures across our system. The trigger was a **data retention** job rolled out in early May 2025, which deleted large volumes of old data from a key table responsible for authorization logic. At first, the system appeared stable. However, **30 days after the initial clean-up job was completed**, PostgreSQL’s autovacuum initiated heap truncation—an optimization that reclaims empty pages at the end of a table. This delay occurred because pages only become truncatable once they become empty, and we have a 30-day retention on the data. Crucially, heap truncation requires an ACCESS EXCLUSIVE lock, the strictest type, which blocks all reads and writes to the table. This behavior was unexpected—autovacuum is typically non-blocking. **By default, PostgreSQL enables heap truncation**, but it rarely occurs and is generally harmless. It only becomes problematic on very busy tables, like ours, where even a brief exclusive lock can significantly disrupt live traffic. **Mitigations** * **Immediate**: Failovers were manually initiated to recover service during each incident. * **Permanent**: Heap truncation has been disabled on the affected table, and we are re-evaluating our data retention approach for tables like this to avoid future locking risks during maintenance operations. **Conclusion** These incidents were caused by an unintended interaction between large-scale data retention and PostgreSQL's autovacuum behavior on critical tables. Remediation steps are now in place to prevent recurrence, and process adjustments are underway to better assess such risks before future retention rollouts.