SchemeServe incident

Issue placing new quotes in SchemeServe

Minor Resolved View vendor source →

SchemeServe experienced a minor incident on November 2, 2023 affecting 🎩 SchemeServe, lasting 2h 29m. The incident has been resolved; the full update timeline is below.

Started
Nov 02, 2023, 10:18 AM UTC
Resolved
Nov 02, 2023, 12:48 PM UTC
Duration
2h 29m
Detected by Pingoru
Nov 02, 2023, 10:18 AM UTC

Affected components

🎩 SchemeServe

Update timeline

  1. investigating Nov 02, 2023, 10:18 AM UTC

    We are currently investigating an issue with placing new quotes in SchemeServe. We will update as soon as we have more information.

  2. investigating Nov 02, 2023, 10:44 AM UTC

    We have identified the issue as a massively increased load on one of our databases and are currently working to reduce this.

  3. monitoring Nov 02, 2023, 10:55 AM UTC

    We have identified the source of the load on the database and have applied mitigation to reduce the load. This has had an immediate effect reducing load by 60%. We will continue to monitor the changes made.

  4. resolved Nov 02, 2023, 12:48 PM UTC

    The issue is now fully resolved. A post-mortem will be posted in due course.

  5. postmortem Nov 02, 2023, 04:01 PM UTC

    SchemeServe runs several performance monitoring scripts and services on our Primary Database to aggregate metrics to help identify performance issues, bad query plans and long-running queries. One of these scripts crossed a tipping point where the resources required to run the script jumped significantly causing a huge load on the database. This in turn caused other queries to run slowly. As some queries within SchemeServe will lock tables with the increased query time, this caused some additional write queries to this table to timeout. The tracking script is only one of many performance points that we use to monitor the database and so has been removed to stop further performance issues. SchemeServe aims to maintain an average overhead of at least 50% processing power on our databases during peak operational hours to allow for surges in demand. issue start time 09:14 issue first identified 09:21 issue resolved 10:38