Locomote incident

Workflow loading issue

Major Resolved View vendor source →

Locomote experienced a major incident on May 20, 2024 affecting Travel Management Platform, lasting 2h 9m. The incident has been resolved; the full update timeline is below.

Started
May 20, 2024, 10:38 PM UTC
Resolved
May 21, 2024, 12:48 AM UTC
Duration
2h 9m
Detected by Pingoru
May 20, 2024, 10:38 PM UTC

Affected components

Travel Management Platform

Update timeline

  1. investigating May 20, 2024, 10:38 PM UTC

    We've received reports of workflow loading failures and are investigating.

  2. identified May 20, 2024, 10:50 PM UTC

    We've identified the cause of workflow loading errors and are rolling out a fix.

  3. monitoring May 20, 2024, 11:03 PM UTC

    Workflows should now be functioning as normal again.

  4. resolved May 21, 2024, 12:48 AM UTC

    This morning engineers identified and resolved an issue with the removal of a downstream 3rd party logging system within the Management Platform. The Management Platform has been rolled back and has now resumed normal function.

  5. postmortem May 21, 2024, 05:59 AM UTC

    This morning engineers identified and resolved an issue with the removal of a downstream system within the Management Platform. The Management Platform has been rolled back and has now resumed normal function. We apologise for the inconvenience caused by this outage. ## Issue detail Locomote has extensive test coverage of our entire application that is the cornerstone of our development confidence. The root cause of this bug slipped past detection in our test suite due to three separate issues aligning: * The cause of the bug was in code that we were deprecating ahead of future removal * The deprecation caused a false positive in a code quality tool, advising a parameter was no longer used at definition site * The removal of the parameter did not flag any test failures, because this specific code only runs in production The net result was that because the code quality tool didn’t account for call-site usage, whenever this codepath was invoked in production the parameter length now mis-matched, causing an exception. This exception was unhandled in this specific case, causing a 503 error for users accessing pages triggering this, most notably workflows. The nature of this particular change \(a deprecation\) didn’t invoke a need for production context testing, as we might otherwise do with specific production-dependent features. Our Engineering team made the choice to roll-back this particular deployment to resolve the issue as quickly as possible, and to give us time to further investigate and test the issue resolution before any future deployments. ## Remediation actions * We’re assessing our codebases for any further environment-dependent code to place extra review flags on changes in those contexts * We’re re-evaluating our out-of-hours escalation process to ensure on-call escalations happen correctly and promptly ‌ - Mario CTO Locomote