Sekoia FRA1 Outage History
Sekoia FRA1 is up right nowSekoia FRA1 had 52 outages in the last 2 years totaling 291h 44m of downtime — averaging 2.1 incidents per month.
There were 52 Sekoia FRA1 outages since June 17, 2025 totaling 291h 44m of downtime. Each is summarised below — incident details, duration, and resolution information.
Delay in alert processing
Timeline · 4 updates
- investigating Feb 04, 2026, 08:14 PM UTC
We detected an outage in our alert processing service. We're currenlty investigating the issue. We appreciate your patience and will provide updates as soon as we have more information.
- identified Feb 04, 2026, 08:15 PM UTC
We identified the issue and fixed it, and are currently catching the lag of delayed alerts.
- monitoring Feb 04, 2026, 08:40 PM UTC
We're currently catching the delay on alert processing and monitoring closely the situation. We appreciate your patience and will provide updates as soon as we have more information.
- resolved Feb 04, 2026, 09:22 PM UTC
Alert processing is fully operational, and all queued alerts have been processed. This incident is resolved, and we apologize for any inconvenience this may have caused.
Event search slowness
Timeline · 4 updates
- investigating Jan 05, 2026, 10:20 AM UTC
We are investigating reports of slow event search.
- identified Jan 05, 2026, 02:54 PM UTC
We have identified the issue, which was linked to a specific usage pattern.
- monitoring Jan 05, 2026, 02:54 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Jan 06, 2026, 12:57 PM UTC
No slowdowns have been observed since 05/01 13:30 CET. This incident is now resolved.
Alert retrieval delays due to cache failover issue
Timeline · 3 updates
- investigating Dec 26, 2025, 10:55 AM UTC
Starting from 7:30, alert retrievals have been delayed due to a failover issue in the sightings cache cluster causing data read inconsistencies. The primary sightings cache instance was manually stopped to force the failover to the secondary instance. Investigation and recovery efforts are ongoing. Customers may experience delays in alert updates during this time.
- identified Dec 26, 2025, 11:13 AM UTC
Following the manual failover of the sightings cache cluster primary instance, alert retrieval processing has resumed and backlog processing is underway. The fix has restored normal operation and alerts are being consumed again. Monitoring is ongoing to ensure full recovery.
- resolved Dec 26, 2025, 02:42 PM UTC
Alert processing is fully operational, and all queued alerts have been processed. Our investigation found no alert loss. This incident is resolved, and we apologize for any inconvenience this may have caused.
Platform slowness
Timeline · 2 updates
- identified Dec 02, 2025, 04:33 PM UTC
We are currently experiencing performance degradation on the platform due to high input/output wait on a relational database. Our engineering team has identified the root cause and is implementing a fix. We do not have an ETA on the full resolution of the incident yet, but we expect it to resolve during this evening. We apologize for the inconvenience.
- resolved Dec 02, 2025, 08:53 PM UTC
Stability was restored around 18:00 by multiple actions from our team, relieving the database load. The performance issues affecting the platform have been resolved as of 20:45. We are implementing long-term improvements to prevent similar incidents and ensure consistent service quality. We apologize for the disruption and appreciate your understanding.
Events unavailability
Timeline · 2 updates
- identified Nov 07, 2025, 04:13 PM UTC
We are aware of a temporary unavailability on events related to alerts. Our teams are already working on the issue and applied a fix, events will be available again shortly. We apologize for the inconvenience.
- resolved Nov 07, 2025, 04:44 PM UTC
Events are available again. Thank you for your patience.
Indexing delay
Timeline · 2 updates
- monitoring Oct 06, 2025, 02:56 PM UTC
A failure in some indexing servers due to an incident on our cloud-provider side led to a temporary stop of indexing. The issue was quickly fixed by our team but it generated some lag in event indexing. We are now monitoring the state of the service and will come back to you once we are back to real-time. There were no data loss, and events processing and alerts raising are still done in time. We are sorry for the inconvenience.
- resolved Oct 06, 2025, 05:01 PM UTC
We are back to real-time indexing. Thank you for your patience
Platform-wide degraded performance.
Timeline · 4 updates
- investigating Oct 03, 2025, 11:50 PM UTC
We are experiencing platform-wide degraded performance. An important relational database host is saturating, causing platform APIs and services to be throttled and to exhibit elevated response times or timeouts. Events collection is not impacted. Keep assured no data is lost, but access to the platform is very degraded. Engineers are investigating the root cause and evaluating mitigations. We will come back to you as soon as we have new information. Sorry for the inconvenience.
- identified Oct 04, 2025, 01:00 AM UTC
We have found the root cause and applied a fix. This implied to restart a service in our events processing pipeline, which shifted the problem to the ingestion. It means you can now access the web app again, but we are now taking a little bit of delay in events processing and alerts raising. We are slowly scaling our service up again, and should catch up on the delay rapidly. We will keep you updated once everything is back to normal. Thank you for your patience.
- monitoring Oct 04, 2025, 01:52 AM UTC
The platform is now back to operational state and we are consuming the delay. Our team is still figuring long-term solutions and working on a fix. We will come back to you once ingestion is back to real-time. Sorry for the inconvenience and thanks for your patience.
- resolved Oct 04, 2025, 02:21 AM UTC
This incident has been resolved.
FRA1 automation features incident
Timeline · 3 updates
- identified Sep 30, 2025, 10:36 PM UTC
We are experiencing an issue with the storage cluster behind the automation feature. We have temporarily suspended the automation features to prevent further errors. Our engineers are actively working on the issue. We appreciate your patience and will provide updates as soon as we have more information.
- monitoring Sep 30, 2025, 11:34 PM UTC
We installed a new server to allow the cluster to repair itself. We restarted the automation features and we are not seeing any errors.
- resolved Sep 30, 2025, 11:45 PM UTC
This incident has been resolved.
FRA1 parsing paused
Timeline · 3 updates
- identified Sep 26, 2025, 03:40 PM UTC
Nodes responsible for event parsing encountered an issue due to a memory starvation. Our team is working to address the issue by restarting the system and forcing reboots. We appreciate your patience while we work towards a resolution.
- monitoring Sep 26, 2025, 04:02 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Sep 27, 2025, 01:26 AM UTC
This incident has been resolved.
Alert bulk actions are stuck
Timeline · 2 updates
- monitoring Sep 26, 2025, 12:41 PM UTC
There was a performance issue on bulk action regarding alerts on the UI. All bulk operations were not lost but slowed down. Currently there is no more impact as a quick fix was implemented but we are monitoring the situation and making sure all bulk actions are successfully executed.
- resolved Sep 26, 2025, 04:20 PM UTC
Everything is running fine, this incident is resolved.
Storage cluster disruption
Timeline · 4 updates
- identified Sep 11, 2025, 04:29 PM UTC
We are unfortunately experiencing the same issue that occurred earlier this week on our storage cluster. Our team is actively working on stabilizing the storage cluster and restarting the service. The procedure will be the same as the last time. Indexing is now stopped. However alerts still raise in time, and no data loss is expected during the incident. We will provide regular updates as the situation progresses.
- identified Sep 11, 2025, 05:24 PM UTC
Our team is actively working on resolving the incident. We are applying corrective measures to stabilize the system and restore full service as quickly as possible. In parallel, we are also investigating the complex root cause of this incident. Real-time data ingestion is being maintained. We will continue to keep you updated on our progress.
- monitoring Sep 11, 2025, 05:45 PM UTC
Indexing was restarted at 19:20 CEST and is stable, and we are now processing the backlog of events accumulated during the downtime. We will share an update as soon as real-time processing is fully restored.
- resolved Sep 11, 2025, 06:34 PM UTC
This incident has been resolved and corrective actions were put in place, as a workaround to our provider's DHCP issues. Our team will be conducting an investigation with OVH teams very shortly.
FRA1 delay on event storage
Timeline · 4 updates
- identified Sep 08, 2025, 10:01 AM UTC
We're currently experiencing performance issues on the event storage cluster. This is leading to delays in event storing. This may be translated in events be available later in both alerts and event page. Our team is actively working on mitigating the situation. We appreciate your patience and understanding as we work to resolve this issue.
- identified Sep 08, 2025, 04:12 PM UTC
We are continuing to work on a fix for this issue.
- monitoring Sep 08, 2025, 06:49 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Sep 08, 2025, 07:47 PM UTC
We have made significant progress in addressing the performance issues. As a result, the events are stored in real-time after the whole detection workflow. This incident is resolved. We appreciate your patience and understanding throughout this process.
FRA1 web UI is unavailable
Timeline · 5 updates
- investigating Sep 04, 2025, 06:19 PM UTC
We are currently experiencing a major incident impacting our service in the FRA1 region. Our cybersecurity platform is currently unreachable from the outside, although our internal systems and automated workflows remain operational. Initial checks show that all system performance indicators are normal. Our engineering team is investigating the situation and working to restore full service as soon as possible. We appreciate your patience and apologize for any inconvenience caused.
- monitoring Sep 04, 2025, 06:32 PM UTC
We are seeing improvements following the previously reported major incident in the FRA1 region. Access to our cybersecurity platform has been restored for many users. We are keeping a close eye on the situation and will provide further updates as necessary. We appreciate your understanding and cooperation during this time.
- identified Sep 04, 2025, 06:58 PM UTC
The situation is currently unstable due to our cloud provider. We are still trying to find a solution.
- monitoring Sep 04, 2025, 08:01 PM UTC
The situation has been fixed. We setup a workaround to provide access to app.sekoia.io. We are still monitoring the situation and assessing the potential impact of the workaround.
- resolved Sep 05, 2025, 10:13 AM UTC
This incident has been resolved.
FRA1 automation features cluster nodes unavailability
Timeline · 3 updates
- identified Aug 17, 2025, 06:10 PM UTC
We are currently experiencing issues with the automation features cluster in the FRA1 region, with some nodes being unavailable due to a provider incident. The team is actively working to resolve the problem. We apologize for any inconvenience and will provide updates as more information becomes available.
- monitoring Aug 17, 2025, 06:40 PM UTC
The situation with the automation features cluster in the FRA1 region has improved, with new nodes being successfully installed and handling the previously pending tasks. We are gradually returning the systems to normal operations and closely monitoring the situation. We appreciate your patience and will provide further updates as necessary.
- resolved Aug 17, 2025, 07:13 PM UTC
The issues with the automation features cluster in the FRA1 region have been successfully resolved. All systems have returned to normal operations and the status of the system is stable.
Some correlation alerts raised with a delay
Timeline · 2 updates
- identified Aug 01, 2025, 03:40 PM UTC
We have identified an issue with the performance of our correlation engine. This has resulted in a delay in processing tasks, for some correlation rules. While those are buffered, we appreciate your patience as we work to resolve this issue. As of now, all correlation rules are being processed in real time.
- resolved Aug 05, 2025, 09:22 AM UTC
The correlation engine is functioning in real time without any issues since the end of our operations on Friday. Please note that our engineering team is working on performance improvements to further enhance our system and avoid similar issues in the future.
FRA1 server instability causing workflow slowdowns
Timeline · 4 updates
- identified Jul 29, 2025, 12:08 PM UTC
We are currently experiencing an issue with a number of servers that are not operational. This is causing a slowdown in our workflows. Our team is actively working on stabilizing the affected servers and managing the high memory usage observed on a given tier of nodes. In the process, we have temporarily paused certain operations to allow for system recovery and offset commitments. Please note that this may result in some event duplication. We will keep you updated as we progress in resolving the issue. Thank you for your patience.
- monitoring Jul 29, 2025, 12:47 PM UTC
The team has successfully stabilized the server situation and resumed operations. We are currently processing incoming data and catching up on the backlog. Please be aware that we are monitoring the situation closely to ensure stable consumption and to address any remaining lag. Investigation into the root cause, a known memory leak in our ingest pods, is ongoing. We appreciate your understanding and patience as we work to fully resolve this issue.
- monitoring Jul 29, 2025, 01:35 PM UTC
We are glad to report that the incident has been largely resolved. Our team has managed to successfully stabilize the servers and has resumed operations. We are currently processing incoming data and making good progress in catching up on the backlog. We want to reassure our clients that no events have been lost. Any "event drop" notifications you may have received can be ignored; the events are being processed gradually. We will continue to monitor the situation closely to ensure stable consumption and to completely eliminate any remaining lag. We appreciate your understanding and patience.
- resolved Jul 29, 2025, 04:49 PM UTC
We are pleased to announce that the incident has been fully resolved. Our team has successfully stabilized the servers, resumed operations, and cleared the backlog of data. All "event drop" notifications received during this incident can be disregarded as no events were lost; all events have been processed. We appreciate your understanding and cooperation during this time and will continue to monitor the situation to ensure stable operation. Thank you for your patience.
FRA1 significant drop in traffic due to potential load balancer issue
Timeline · 4 updates
- investigating Jul 24, 2025, 09:29 AM UTC
We are currently experiencing a significant drop in traffic. Initial suspicions point towards an issue with the upstream OVH load balancer. Our team is actively investigating and contacting support for further assistance. The user interface and API are not affected by this incident as the load balancers for these services are separated. We will provide further updates as the situation develops.
- identified Jul 24, 2025, 09:34 AM UTC
The traffic issue in the FRA1 region is beginning to resolve. However, due to the buffered traffic on the client side, some lag may be experienced. Our team continues to monitor the situation closely, and we will provide further updates as soon as possible. We appreciate your patience.
- monitoring Jul 24, 2025, 09:55 AM UTC
The traffic issue experienced in the FRA1 region is gradually stabilizing. As the buffered client-side traffic is processed, some lag may still be experienced, particularly with ingestion and workflow. However, the platform is handling the traffic spike effectively. Our team continues to monitor the situation closely, and we will provide further updates as necessary. Thank you for your patience.
- resolved Jul 24, 2025, 02:40 PM UTC
The traffic issue experienced in the FRA1 region has now stabilized. The buffered client-side traffic has been processed and the platform is back to operating normally.
FRA1 alert generation delays
Timeline · 3 updates
- investigating Jul 22, 2025, 08:30 AM UTC
We are currently experiencing a delay in alert generation in the FRA1 region. This is due to an increased volume of alerts since 9:50. Our team is actively working on mitigating the issue. We apologize for any inconvenience caused and appreciate your patience as we resolve this.
- monitoring Jul 22, 2025, 08:50 AM UTC
Our team has reduced the pressure on our systems and the delay in alert generation appears to be decreasing following our actions. We continue to monitor the situation closely and will provide further updates as necessary. Thank you for your understanding and patience.
- resolved Jul 22, 2025, 11:35 AM UTC
This incident is resolved since 12:10 CEST
Event indexation delays
Timeline · 4 updates
- investigating Jul 18, 2025, 01:44 PM UTC
We are currently investigating an issue causing event indexation delays in the FRA1 region. Our engineering team is working to resolve the issue and reduce the backlog as quickly as possible. We appreciate your patience and will provide updates as soon as we have more information.
- identified Jul 18, 2025, 01:47 PM UTC
We identified the issue and applied a fix. Events ingestion is working as expected now, and we have to catch a little lag in events ingestion. We appreciate your patience and will provide updates as soon as we have more information.
- monitoring Jul 18, 2025, 01:56 PM UTC
Our engineering team has identified the cause of the event indexation delay in the FRA1 region and implemented a fix. The backlog of events is still processing. We will continue to monitor the situation closely and provide updates as necessary.
- resolved Jul 18, 2025, 04:23 PM UTC
This incident has been resolved.
Delay on alerts processing
Timeline · 3 updates
- identified Jul 07, 2025, 04:11 PM UTC
We are currently experiencing a delay in alert processing. Some alerts may not display their associated events sometimes. Our team has identified the root cause and is actively working on resolving the issue. Please note that this does not affect the detection of security incidents but may delay the visibility of events associated to alerts. We appreciate your patience and understanding.
- monitoring Jul 07, 2025, 04:23 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Jul 07, 2025, 04:49 PM UTC
All systems are now operating normally, and there is no longer any lag in the processing of alerts or events associated to alerts. We appreciate your patience and understanding during this time.
Delay to execute playbooks actions
Timeline · 4 updates
- investigating Jul 04, 2025, 09:50 AM UTC
We are currently investigating an issue that is generating a large number of new tasks within our automation system, causing a queue to fill. Our team is actively working on this to diagnose the problem. Further updates will be provided as more information becomes available.
- identified Jul 04, 2025, 10:07 AM UTC
We have identified an issue related to an unusually high number of tasks generated, causing our task management system to overload. Our team continues to work on resolving this incident, and we will provide further updates as the situation evolves.
- monitoring Jul 04, 2025, 11:19 AM UTC
Our team has made significant progress in resolving the lag issue with playbooks execution. After identifying the root cause, we implemented measures to handle new tasks in real-time. We're now executing tasks in real-time, so everything is operational We are now closely monitoring the situation to ensure all tasks are being processed as expected. We will continue to provide updates as more information becomes available.
- resolved Jul 04, 2025, 01:16 PM UTC
All tasks are now being processed in real-time with no pending tasks in the queue. Incident is now solved. We appreciate your patience and understanding during this incident.
Event storage slowness
Timeline · 3 updates
- identified Jul 01, 2025, 10:09 AM UTC
We are currently experiencing delays in event indexation in our FRA1 region due to an unexpected spike in request volumes. Our team is actively working to mitigate the issue. You can expect some delay in events indexation. We appreciate your patience and understanding.
- monitoring Jul 01, 2025, 10:43 AM UTC
The request volume on our FRA1 event storage cluster has stabilized, and event indexation is back to normal. You may still experience a short delay when viewing events as we are catching up on it. We are also planning a solution to fix the underlying issue. We appreciate your patience as we work to resolve this.
- resolved Jul 01, 2025, 12:50 PM UTC
This incident has been resolved.
Playbook module issue
Timeline · 2 updates
- identified Jun 30, 2025, 06:49 PM UTC
We detected an issue concerning the HTTP module on playbooks, starting at 18:30 CEST, and preventing runs of this module to be processed. This have caused timeouts on some playbook runs using this module. We applied a fix, new jobs are not impacted since 20:24 CEST. We are now running jobs that timed out again.
- resolved Jun 30, 2025, 08:53 PM UTC
The issue on playbooks using the HTTP module is now fully resolved and the Automation processes are Operational. New jobs created since 20:24 CEST were running without issue and we have re-run all jobs that failed during the incident. We appreciate your understanding throughout this incident and apologize for any inconvenience caused.
Degraded performance on intake endpoints
Timeline · 4 updates
- investigating Jun 25, 2025, 07:27 AM UTC
We are currently experiencing issues with our data intake service. Initial investigations suggest a possible problem with our public reverse proxy or the external load balancer service. We are actively working on the issue and have reached out to our external service provider for assistance. Please note that this could affect data ingestion. We apologize for any inconvenience caused and we appreciate your patience as we work to resolve this incident.
- identified Jun 25, 2025, 07:33 AM UTC
We have identified the issue affecting our data intake service. We're currently fixing the issue.
- monitoring Jun 25, 2025, 08:27 AM UTC
The issue with our data intake service has been identified and resolved. The problem was due to a network issue between our reverse proxy servers and our external service provider load balancer. This caused an inability to ingest events sent via HTTP and SYSLOG. We are currently monitoring the situation closely to ensure stability. We appreciate your patience and understanding during this incident.
- resolved Jun 25, 2025, 05:48 PM UTC
This incident has been resolved.