Ryft Pay incident

Elevated API errors

Major Resolved View vendor source →

Ryft Pay experienced a major incident on November 24, 2025 affecting Payments API, lasting 15m. The incident has been resolved; the full update timeline is below.

Started
Nov 24, 2025, 05:40 PM UTC
Resolved
Nov 24, 2025, 05:55 PM UTC
Duration
15m
Detected by Pingoru
Nov 24, 2025, 05:40 PM UTC

Affected components

Payments API

Update timeline

  1. investigating Nov 24, 2025, 05:51 PM UTC

    We are currently investigating this issue.

  2. identified Nov 24, 2025, 05:51 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Nov 24, 2025, 05:51 PM UTC

    The elevated error rates lasting approx 20 minutes have now been resolved. We apologise for any inconvenience caused

  4. resolved Nov 24, 2025, 05:55 PM UTC

    The incident has now been resolved.

  5. postmortem Nov 25, 2025, 09:30 AM UTC

    **Summary** The root cause of the incident was due to a faulty configuration update during a deployment. This lead to a period of time whereby the deployment partially served traffic prior to being classified as unhealthy. The impacted API resources were as follows: * `v1/payment-sessions` **Timeline** The erroneous deployment went live at 5:21pm UTC. Live traffic was switched to the new instances at 5:23pm. On-site developers noticed elevated errors originating from the new nodes at 5:25pm and initiated a rollback at 5:30pm. The rollback was completed at 5:49pm and saw an instant reduction of the errors introduced by the previous deployment. The total impact time was approx 25 minutes. **What are we doing about it?** * Developers have introduced additional measures to detect faulty configuration updates. These steps will prevent bad configuration being deployable going forward. * Improvements to our rollback policies will ensure a more timely rollback in the future * The team will make adjustments to our rolling deployments whereby live traffic will be served for a longer period of time prior to being switched over to the latest deployed instances. This gives a larger window of time in which bad updates can be detected and averted before impacting our customers.