Squiz experienced a major incident on April 25, 2024 affecting Squiz SaaS Hosted Instances and Squiz Funnelback Hosted Instances, lasting 4h 40m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Apr 25, 2024, 07:59 AM UTC
Squiz monitoring has detected a degradation of service in our Funnelback Pods Squiz is working hard to investigate the route cause of the issue and will provide further updates via https://status.squiz.cloud in 15 minutes, or earlier if the situation or information changes.
- investigating Apr 25, 2024, 08:11 AM UTC
We are continuing to investigate this issue.
- investigating Apr 25, 2024, 08:45 AM UTC
We are currently experiencing a degradation impacting Funnelback and continue to investigate the situation. We have announced a major Incident and have multiple teams engaged in resolution.
- monitoring Apr 25, 2024, 09:08 AM UTC
Our engineers have identified the likely cause of the incident and have implemented an appropriate fix. We are now testing and monitoring search performance, and have started to see signs of recovery.
- monitoring Apr 25, 2024, 09:20 AM UTC
Our fix resulted in initial performance improvements, however we are still seeing a performance degradation. Our teams continue to work on a full resolution.
- monitoring Apr 25, 2024, 09:42 AM UTC
Our engineers continue to address the issue and apply fixes where possible to improve service performance.
- monitoring Apr 25, 2024, 10:04 AM UTC
We are continuing to work on an effective fix for this issue.
- monitoring Apr 25, 2024, 10:28 AM UTC
The problem is receiving our full attention as our engineers continue to work on improving service performance through applied fixes.
- monitoring Apr 25, 2024, 10:48 AM UTC
Our engineering team is actively engaged in resolving the issue and applying fixes to boost service performance.
- monitoring Apr 25, 2024, 11:13 AM UTC
The problem is receiving our full attention as our engineers work to improve service performance through applied fixes.
- monitoring Apr 25, 2024, 11:34 AM UTC
We are continuing to work on an effective fix for this issue.
- monitoring Apr 25, 2024, 11:55 AM UTC
Our engineers continue to address the issue and apply fixes where possible to improve service performance.
- monitoring Apr 25, 2024, 12:09 PM UTC
We are now seeing an improvement in performance and are continuing to investigate the root cause of this issue.
- monitoring Apr 25, 2024, 12:18 PM UTC
We are now seeing systems recovering and are continuing to monitor in case of further issues. A post mortem for this incident will also be made available on https://status.squiz.cloud/ in the coming days.
- resolved Apr 25, 2024, 12:39 PM UTC
We are pleased to confirm that the previously reported issue affecting the performance of our Funnelback system has been successfully resolved. Our team closely monitored the situation, and were able to apply a fix for the issue, which led to significant improvements in performance. We will continue to keep a watchful eye on the system to ensure optimal performance and stability. We appreciate your patience and understanding during this time and apologise for any inconvenience caused. A post mortem will be made available on https://status.squiz.cloud/ in the coming days.
- postmortem Apr 26, 2024, 12:30 PM UTC
### Summary During routine monitoring, Squiz identified operational issues with multiple Funnelback servers, leading to search function disruptions for several customers. ### Customer impact A subset of UK Customers may have experienced delays in search results and encountered 500 errors when attempting to utilise the search function. ### Issue and Resolution Squiz engineers were alerted to errors and timeouts originating from our Squiz Funnelback services. Subsequent investigation revealed that the search session functionality within Funnelback was causing slow or erroneous requests, leading to a build up of requests within the search query processing pipeline. Requests utilising the search session feature were subject to slow response times or termination. In turn this impacted performance and resulted in timed out searches. In response, we isolated the specific searches and their connection behaviour to our session feature and, where needed, disabled/paused the use of this feature temporarily to allow the query processing pipeline to recover. As a precautionary measure, resource allocation to the query processing pipeline was increased. As part of our standard process we initiated a period of heightened monitoring leading to resolution on April 25th at 13:00 BST ### Mitigation In light of this incident, Squiz support staff conducted a thorough review of our UK Funnelback systems to preempt future disruptions, including the expansion of memory resources. In addition, measures have been taken to enhance processes enabling fast-tracked resolutions to similar incidents in the future.