Safe FME incident
Incorrect alerts are being issued for FME Cloud instances
Safe FME experienced a minor incident on March 2, 2023 affecting FME Flow Hosted Dashboard/API and FME Flow Hosted Instances, lasting 3d 22h. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- identified Mar 02, 2023, 06:18 PM UTC
FME Cloud is experiencing issues with the service it uses for alerting. Alerts are being incorrectly triggered relating to disk and memory. Please disregard the alerts. We are currently waiting for assistance from our metrics and alerting service.
- identified Mar 03, 2023, 01:34 AM UTC
Currently there are still issues with FME Cloud metrics and alerting. Our service provider is looking into the issue but at this time we do not have an ETA for when things will be fixed.
- monitoring Mar 03, 2023, 11:27 PM UTC
Monitoring and alerts appears to have recovered. We are still unsure of the root cause and waiting to get a resolution from our service provider before resolving this incident.
- resolved Mar 06, 2023, 05:11 PM UTC
This incident has been resolved. From our service provider: "The summary of the problem is that one of the internal services responsible for the metrics portion of the product was occasionally failing to keep up with realtime traffic, and as a result a subset of metrics were being impacted. This should be fixed now."
- postmortem Mar 08, 2023, 09:31 PM UTC
We’d like to apologize to all FME Cloud customers who were affected by incorrect alerts for their instances. On March 2nd, 2023 Safe Software noticed that incorrect alerts were being triggered for FME Cloud instances for disk space and memory usage. Internal investigation showed that the composite data for disk space was missing and memory data was wrong. We reached out to our service provider for metrics and alerting \(Librato\) and updated the Safe Software status page to show degraded performance for FME Cloud Dashboards. Overnight it appeared to recover, so the status was moved to monitoring while we waited for a response or confirmation from Librato. On March 8th Librato reported that one of their internal services responsible for metrics was occasionally failing to keep up with realtime traffic, and as a result a subset of metrics were being impacted. This is fixed now. Safe Software has not had any incorrect alerts since March 3rd. In addition to the response from Librato we are confident this issue has been resolved. The status of FME Cloud Dashboards has returned to operational and the incident is resolved on [status.safe.com.](http://status.safe.com)