Adapty incident

SDK HTTP API partially unavailable due to storage nodes desynchronization

Adapty experienced a major incident on May 19, 2026 affecting API, lasting 2m. The incident has been resolved; the full update timeline is below.

Started: May 19, 2026, 04:45 PM UTC
Resolved: May 19, 2026, 04:47 PM UTC
Duration: 2m
Detected by Pingoru: May 19, 2026, 04:45 PM UTC

Affected components

API

Update timeline

investigating May 19, 2026, 04:38 PM UTC

We are currently investigating this issue.
monitoring May 19, 2026, 04:47 PM UTC

We have applied the fix. Storage nodes are in sync now. Response time and HTTP status codes are getting back to normal.
resolved May 19, 2026, 04:52 PM UTC

All SDK HTTP API metrics have returned to normal. As part of our service security enhancement efforts, we accidentally blocked clock synchronization traffic. Clock synchronization is essential for our distributed storage.
postmortem May 19, 2026, 05:36 PM UTC

## Postmortem — SDK API & Dashboard 5xx errors \(2026-05-19\) ### Summary On 2026-05-19 between 16:21 and 16:55 UTC, the Adapty SDK API and Dashboard returned elevated 5xx errors for approximately 34 minutes. The primary key-value store in the SDK request path entered a protective "stop-writes" state after its cluster nodes' clocks drifted past the allowed skew threshold; the drift accumulated because outbound NTP traffic from the affected hosts had been blocked at the network firewall layer, leaving the nodes unable to synchronize with upstream time sources. A clock step-correction was applied, the database resumed accepting writes within one minute, and all customer-visible errors cleared by 16:55 UTC. ### Impact * **Duration:** Approximately 34 minutes \(16:21 → 16:55 UTC\); the most acute window was 16:21 → 16:46 UTC. * **Surfaces affected:** SDK endpoints \(profile updates, attribution, receipt validation, App Store / Play Store / Stripe / Paddle webhook deliveries\), the Dashboard, and 3rd-party integration deliveries. * **Visible signal:** Elevated 5xx rates at the edge for SDK endpoints; the mobile-API readiness probe briefly reported DOWN between 16:43 and 16:45 UTC. * **Recovery:** SDK clients that automatically retry recovered most requests on retry. Some webhook deliveries and attribution events that do not retry may require backfill or replay. ### Timeline \(UTC\) | Time | Event | | --- | --- | | ~16:00 | Database cluster clock skew crosses the configured stop-writes threshold; the protective alert enters its evaluation delay window. | | 16:21 | Stop-writes condition activates; the database begins rejecting writes; SDK 5xx errors begin at the edge. | | 16:22–16:32 | 5xx error rate climbs across SDK endpoints. | | 16:43 | NTP step-correction applied to the affected hosts; cluster clock skew drops from ~38 seconds to under 1 second within ~1 minute. | | 16:46 | The database resumes accepting writes; SDK 5xx rates begin returning to baseline. | | 16:55 | All customer-visible errors cleared; incident fully resolved. | ### Root cause and contributing factors **Root cause:** The SDK request path depends on a clustered key-value store that protects data consistency by halting writes when cluster nodes' clocks disagree by more than a configured threshold. On 2026-05-19, the nodes' clocks had been drifting for an extended period and crossed that threshold around 16:00 UTC, which caused the cluster to halt writes and the SDK API to return 5xx. **Contributing factors:** 1. **Outbound NTP egress was blocked** at the network firewall layer for the affected hosts, so the time daemon on each node could not reach upstream NTP servers. The block accumulated drift silently over many hours. 2. **No early-warning alert on clock drift.** The only alert in place fired at the protective stop-writes threshold — when the incident was already in progress, not while it was approaching. There was no graduated alert at a lower skew value that would have provided hours of lead time. 3. **No direct NTP-synchronization health check** at the host level. Time-sync failure was only observable as a downstream effect on cluster skew, not as a primary signal. 4. **Protective stop-writes worked as designed.** The database correctly refused potentially-inconsistent writes, which is preferred over silent data corruption; however, this surfaces to customers as 5xx rather than as a graceful degradation. ### What went well * The protective stop-writes mechanism behaved as intended and prevented data inconsistency in the affected database. * Once identified, the fix was surgical and fast: a single clock step-correction returned cluster skew from ~38 seconds to under 1 second in ~1 minute, and the database recovered immediately afterward. ### What we will do **Prevent** — stop the same cause from recurring * Audit firewall egress rules on every host in the affected tier; confirm outbound NTP \(UDP/123\) is explicitly allowed to documented upstream time sources; add an active probe that verifies egress and alerts if it regresses. * Document the time-synchronization topology for the affected database tier and remove ambiguity about expected behavior. **Detect** — catch the same cause faster next time * Add an early-warning alert on cluster clock skew at a fraction of the protective threshold, with a multi-minute evaluation window, so drift is visible hours before it can cause stop-writes. * Add a direct NTP-synchronization health alert \(time-daemon offset, stratum, last-sync-age\) on every host in the affected tier, independent of the cluster skew metric. **Mitigate** — limit blast radius if the same cause recurs * Review whether the affected SDK endpoints can degrade more gracefully when the database is in stop-writes mode \(read-only path, cached fallback, queue-and-retry\), rather than returning immediate 5xx. We apologize for the disruption to your applications during this window. If your integration is missing data from this period, please reach out to your usual Adapty contact and reference the 2026-05-19 incident.