Ookla incident

Database load balancer outage on Spatialbuzz

Notice Resolved View vendor source →

Ookla experienced a notice incident on October 1, 2025 affecting Downdetector Connect (Spatialbuzz), lasting 2h 59m. The incident has been resolved; the full update timeline is below.

Started
Oct 01, 2025, 10:36 AM UTC
Resolved
Oct 01, 2025, 01:35 PM UTC
Duration
2h 59m
Detected by Pingoru
Oct 01, 2025, 10:36 AM UTC

Affected components

Downdetector Connect (Spatialbuzz)

Update timeline

  1. investigating Oct 01, 2025, 10:36 AM UTC

    We are currently investigating an issue with our database load balancers, causing an outage for customers.

  2. identified Oct 01, 2025, 11:45 AM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Oct 01, 2025, 01:07 PM UTC

    A fix has been deployed, some caching issues remain for some login areas.

  4. resolved Oct 01, 2025, 01:35 PM UTC

    This incident has been resolved.

  5. postmortem Oct 01, 2025, 02:50 PM UTC

    ## Incident Summary **Date:** October 1, 2025 **Impact:** Full outage for all European customers **Status:** Resolved --- ### Incident Overview On **October 1st at 11:10 BST**, our internal service discovery system in the European region experienced a failure following a low-risk certificate update. This unexpectedly caused dependent systems—including databases and backend services—to become unavailable. As a result, **all customers using our European platform were impacted** and experienced downtime. --- ### Timeline \(BST\) | Time | Event | | --- | --- | | 10:58 | Internal service certificates were renewed and deployed \(low-risk change\) | | 11:00 | Backend services began receiving updated certificates | | 11:10 | Core coordination service failed to re-establish leadership; investigation began | | 11:15 | First customer reports of service errors received | | 11:36 | Incident publicly posted on status page | | 12:40 | Root cause identified and recovery process initiated | | 12:45 | Status page updated | | 13:40 | Coordination service quorum restored and systems recovered | | 13:50 | Application deployments restarted to refresh service connectivity | | 14:07 | Status page updated | | 14:35 | Incident marked as resolved | --- ### What Went Wrong An internal certificate update—considered low risk—led to a failure in one of our backend coordination systems. This caused a loss of service discovery across our European platform, impacting all services that rely on this mechanism to communicate, including core databases. The issue required manual intervention to restore control and quorum within the system. --- ### How We Resolved It * Launched a new set of coordination nodes using a known working configuration * Temporarily bypassed strict internal security controls to allow recovery * Manually brought the core cluster back online and ensured stability * Restarted dependent services to refresh their service discovery state * Closely monitored system health during and after recovery --- ### What We're Doing Next To prevent similar incidents in the future, we are: * **Improving certificate update validation and rollback procedures** * **Automating recovery workflows** for coordination service failures * Beginning to **evaluate alternative service discovery strategies** with reduced operational risk and complexity --- ### Final Note We sincerely apologise for the disruption caused. This incident affected all customers using our European platform, and we take this extremely seriously. We are committed to improving our systems to reduce the risk of similar outages in the future.