Eleos Technologies incident

Reduction in Errors Logged to Error console

Major Resolved View vendor source →

Eleos Technologies experienced a major incident on May 2, 2024 affecting API and App Manager and 1 more component, lasting 2h 5m. The incident has been resolved; the full update timeline is below.

Started
May 02, 2024, 08:31 PM UTC
Resolved
May 02, 2024, 10:36 PM UTC
Duration
2h 5m
Detected by Pingoru
May 02, 2024, 08:31 PM UTC

Affected components

APIApp ManagerMobile AppsDocument Hub and Drive AxleDocument Delivery

Update timeline

  1. investigating May 02, 2024, 08:31 PM UTC

    We are reducing the number of errors being logged to error console to 50% of their normal volumes during this time period to mitigate potential performance issues.

  2. investigating May 02, 2024, 08:36 PM UTC

    We are seeing the same issues as earlier for Drive Axle, Document Hub, App Manager, and Platform document processing.

  3. investigating May 02, 2024, 08:38 PM UTC

    We are currently not logging errors to error console at this time.

  4. investigating May 02, 2024, 08:48 PM UTC

    We are still investigating the cause of this issue. Drive Axle, the Document Hub, and App Manager login are not working.

  5. monitoring May 02, 2024, 09:10 PM UTC

    We are continuing to investigate this issue. At this time, it appears that error rates are approaching normal, however we are continuing to monitor the situation. We believe that this issue is related to difficulties involving a third party partner, and we're working to mitigate the effect these issues are having on Drive Axle, the Document Hub, and App Platform.

  6. monitoring May 02, 2024, 09:46 PM UTC

    Our third party partner has indicated they have applied mitigations on their end, and our error rates are are normal. We are continuing to monitor the situation. Error console is now logging at it's normal rate.

  7. monitoring May 02, 2024, 10:29 PM UTC

    We are still currently monitoring at this time and error rates are normal. During 18:15 UTC to 20:45 UTC, a small percentage of workflow actions that used telematics data and a percentage of loads requests for users with the `manage_shipments` flag set on their authentication requests failed. Intermittently failed requests during this time period would have been retried.

  8. resolved May 02, 2024, 10:36 PM UTC

    We are marking this incident as resolved for now, but we're continuing to monitor system performance.

  9. postmortem May 03, 2024, 09:06 PM UTC

    We have identified the root cause of the Eleos Platform outage that occurred on May 2, 2024 and have prepared an emergency mitigation that can be applied in case of another occurrence. This outage resulted from unexpected system behavior during a partner’s separate outage. The partner has indicated a full resolution as of May 2, 2024 at 22:00 UTC, so at this time, we do not expect an imminent recurrence. However, we’re working to implement a complete fix to the root cause of the unexpected behavior. We’ll share a more detailed post-mortem in the coming week.