Sekoia FRA1 incident

Storage cluster disruption

Sekoia FRA1 experienced a critical incident on September 9, 2025 affecting Event storage, lasting 1d 1h. The incident has been resolved; the full update timeline is below.

Started: Sep 09, 2025, 08:13 AM UTC
Resolved: Sep 10, 2025, 10:03 AM UTC
Duration: 1d 1h
Detected by Pingoru: Sep 09, 2025, 08:13 AM UTC

Affected components

Event storage

Update timeline

investigating Sep 09, 2025, 08:13 AM UTC

We are currently experiencing an incident affecting the event storage cluster. This is causing a disruption in indexing and events search jobs. Our engineering team is already investigating the issue. We will update as soon as we have more information. We apologize for any inconvenience this may cause.
identified Sep 09, 2025, 08:31 AM UTC

Our engineering team is implementing a solution to alleviate pressure on the event storage cluster by sequentially restarting the services on all affected machines. This procedure is aimed at allowing the storage cluster to gradually reintegrate the services. We will continue to monitor the situation closely and will provide updates as necessary. Thank you for your patience.
identified Sep 09, 2025, 10:37 AM UTC

Our engineering team is still currently restarting all impacted machines, and the storage cluster is coming back up slowly. We will keep you updated once we will start the indexing process again. Thank you for your patience.
identified Sep 09, 2025, 11:52 AM UTC

Our storage cluster is currently being stabilized, and our team is applying operations to ensure reliability. In the meantime, we are preparing to restore real-time processing later this afternoon, while safely storing the data backlog accumulated this morning. The backlog will be processed later today, once overall traffic is lower. No data loss is expected. Thank you for your patience.
monitoring Sep 09, 2025, 01:27 PM UTC

We have resumed processing close to real-time data, and the backlog has been safely stored for processing later today. There is still a slight delay in event processing, which should clear during the afternoon. Alerts are now being raised on time, and events from 13:40 CEST onward should be visible on the events page. Stabilization work on the storage cluster is still ongoing. We will continue to provide updates on the situation. We apologize for the inconvenience.
monitoring Sep 09, 2025, 05:59 PM UTC

With overall traffic now lower, we have started indexing the backlog of events. This process will continue overnight and is expected to be completed before tomorrow morning. Real-time traffic continues to be indexed without interruption, and no data has been lost during this incident. Thank you for your patience throughout this long incident.
resolved Sep 10, 2025, 10:03 AM UTC

All backlog was successfully processed overnight, and the platform has been operating normally since this morning. We are actively implementing preventive measures to avoid a recurrence of this incident, and a detailed post-mortem will be shared soon. We apologize for the inconvenience and thank you for your patience.