Latitude.sh incident

Incident with Databases and Filesystems

Critical Resolved View vendor source →

Latitude.sh experienced a critical incident on July 29, 2025 affecting Databases and Filesystem, lasting 2d 3h. The incident has been resolved; the full update timeline is below.

Started
Jul 29, 2025, 03:07 PM UTC
Resolved
Jul 31, 2025, 06:52 PM UTC
Duration
2d 3h
Detected by Pingoru
Jul 29, 2025, 03:07 PM UTC

Affected components

DatabasesFilesystem

Update timeline

  1. investigating Jul 29, 2025, 03:07 PM UTC

    We have identified an issue with our Databases cluster, which has impacted the application availability. Our team is actively working on restoring these services.

  2. identified Jul 30, 2025, 02:05 AM UTC

    The issue has been identified and a fix is being implemented.

  3. resolved Jul 31, 2025, 06:52 PM UTC

    This incident has been resolved.

  4. postmortem Jul 31, 2025, 06:56 PM UTC

    **Impact:** All customer databases in the Dallas region experienced unavailability during this window. No data loss occurred. On July 28th, 2025, the Latitude.sh Databases cluster in our Dallas region experienced a critical failure due to a broader site-level outage. The incident led to the loss of the cluster’s internal state, rendering it unsalvageable. As a result, all customer databases in this region became unavailable. Our engineering team immediately initiated recovery efforts. We provisioned a new environment, redeployed the database control plane, and restored each customer database from off-site backups. After 35 hours of continuous work, services were fully restored. All customer data is safe and accessible. However, configuration-level metadata \(such as trusted sources\) could not be recovered and must be manually recreated. The failure originated from the control plane node pool of the Dallas cluster. The resulting corruption of the cluster’s internal state made it impossible to safely recover or rejoin the remaining nodes. Compounding the issue was the absence of recent control plane snapshots, which limited restoration options. We are conducting a full forensic analysis to determine the root cause of the corruption and evaluate the failover mechanisms that were expected to mitigate such an event. **Impact** * **All databases in Dallas were unavailable for the period** * **All customer data was preserved** and restored to new clusters * **Trusted sources \(firewall rules\)** were lost and must be reconfigured by customers **Immediate Actions Taken** * Isolated the failed cluster to prevent further damage * Deployed a new cluster in Dallas * Restored customer databases from off-site backups * Validated database integrity and access for each tenant **Customer Actions Required** Since all databases were recreated in a new environment, customers **must take the following steps**: 1. **Update Database Connection URIs and Credentials** Your database URI and credentials have changed. Please check your dashboard or reach out to support to retrieve your new connection details. 2. **Recreate Trusted Sources** Any previously configured trusted sources \(firewall allowlists\) were not recoverable and need to be manually re-added. 3. **Review Application Integrations** If you have automated services or applications depending on the old URI or IPs, ensure those are updated to avoid connectivity issues.