Sekoia FRA1 incident

FRA1 server instability causing workflow slowdowns

Sekoia FRA1 experienced a major incident on July 29, 2025 affecting Detection, lasting 4h 40m. The incident has been resolved; the full update timeline is below.

Started: Jul 29, 2025, 12:08 PM UTC
Resolved: Jul 29, 2025, 04:49 PM UTC
Duration: 4h 40m
Detected by Pingoru: Jul 29, 2025, 12:08 PM UTC

Affected components

Detection

Update timeline

identified Jul 29, 2025, 12:08 PM UTC

We are currently experiencing an issue with a number of servers that are not operational. This is causing a slowdown in our workflows. Our team is actively working on stabilizing the affected servers and managing the high memory usage observed on a given tier of nodes. In the process, we have temporarily paused certain operations to allow for system recovery and offset commitments. Please note that this may result in some event duplication. We will keep you updated as we progress in resolving the issue. Thank you for your patience.
monitoring Jul 29, 2025, 12:47 PM UTC

The team has successfully stabilized the server situation and resumed operations. We are currently processing incoming data and catching up on the backlog. Please be aware that we are monitoring the situation closely to ensure stable consumption and to address any remaining lag. Investigation into the root cause, a known memory leak in our ingest pods, is ongoing. We appreciate your understanding and patience as we work to fully resolve this issue.
monitoring Jul 29, 2025, 01:35 PM UTC

We are glad to report that the incident has been largely resolved. Our team has managed to successfully stabilize the servers and has resumed operations. We are currently processing incoming data and making good progress in catching up on the backlog. We want to reassure our clients that no events have been lost. Any "event drop" notifications you may have received can be ignored; the events are being processed gradually. We will continue to monitor the situation closely to ensure stable consumption and to completely eliminate any remaining lag. We appreciate your understanding and patience.
resolved Jul 29, 2025, 04:49 PM UTC

We are pleased to announce that the incident has been fully resolved. Our team has successfully stabilized the servers, resumed operations, and cleared the backlog of data. All "event drop" notifications received during this incident can be disregarded as no events were lost; all events have been processed. We appreciate your understanding and cooperation during this time and will continue to monitor the situation to ensure stable operation. Thank you for your patience.