Pay & Connect incident

Increased DB CPU usage

Critical Resolved View vendor source →

Pay & Connect experienced a critical incident on August 25, 2022, lasting —. The incident has been resolved; the full update timeline is below.

Started
Aug 25, 2022, 03:30 PM UTC
Resolved
Aug 25, 2022, 03:30 PM UTC
Duration
Detected by Pingoru
Aug 25, 2022, 03:30 PM UTC

Update timeline

  1. resolved Aug 26, 2022, 04:05 PM UTC

    An increase of DB CPU usage was detected, which caused an increase in latency throughout the system.

  2. postmortem Aug 26, 2022, 04:05 PM UTC

    To relieve the strain on the DB server while the source of the issue was investigated, the DB was allocated additional resources in the form of an increase in CPU. The added resources were sufficient to bring the CPU usage back down to acceptable levels temporarily, but this morning \(26/8\) those resources were stretched again to their limits and the server started presenting first increased latency, and eventually stopped serving requests. At this point we increased the DB CPU resources again even further to immediately relieve the load on the server, and increased efforts to establish the root cause. We isolated a particularly slow and long running query which had started showing performance degradation as a result of the transaction table size. We managed to implement a dramatic optimisation of the query and deployed an update soon after. Following this query optimisation we are able to see enormous improvements on the server performance metrics.