Tilroy incident

Reports of very slow Tilroy performance

Major Resolved View vendor source →

Tilroy experienced a major incident on May 18, 2026 affecting Point Of Sale and Data management and reporting, lasting 1h 4m. The incident has been resolved; the full update timeline is below.

Started
May 18, 2026, 11:16 AM UTC
Resolved
May 18, 2026, 12:21 PM UTC
Duration
1h 4m
Detected by Pingoru
May 18, 2026, 11:16 AM UTC

Affected components

Point Of SaleData management and reporting

Update timeline

  1. investigating May 18, 2026, 11:16 AM UTC

    We are notified for slower Tilroy performance. This is under investigation.

  2. investigating May 18, 2026, 11:18 AM UTC

    Tilroy not available at the moment.

  3. investigating May 18, 2026, 11:44 AM UTC

    A problem was identified with one of the databases. The system is online again.

  4. monitoring May 18, 2026, 11:44 AM UTC

    A fix has been implemented and we are monitoring the results.

  5. monitoring May 18, 2026, 12:06 PM UTC

    We are continuing to monitor for any further issues.

  6. resolved May 18, 2026, 12:21 PM UTC

    This incident has been resolved.

  7. postmortem May 19, 2026, 05:42 AM UTC

    **Summary** On Monday 18 May 2026 around 13:00 \(local time\), our production database became unavailable for 26 minutes. The cluster recovered automatically and was fully back in service by 13:43. No data was lost. **Background** Our database runs as a three-node cluster: three identical copies of our data, kept in sync across three separate servers. One node is the primary - it accepts new data. The other two are secondaries - they continuously copy the primary. At least two of the three nodes must be online for the cluster to keep accepting data. If only one remains, the cluster intentionally stops, to prevent the servers from disagreeing on what the latest data is. What happened at around 13:05, the primary node started to slow down and stopped responding to the other two. Within seconds, the cluster automatically promoted one of the secondaries to be the new primary,a routine failover that should have been invisible to users. This time the failover did not complete cleanly. The new primary inherited a large backlog of unfinished work from the old primary and became overloaded while trying to clear it. At 13:14, the new primary hit a built-in safety mechanism : when a node is stuck and cannot complete its role change within 30 seconds, it shuts itself down so the rest of the cluster can move on. About 30 seconds later, at 13:15, the original primary, still trying to step down, hit the exact same safety mechanism and also shut itself down. With two of the three nodes down at the same time, the cluster fell below the two-node minimum and stopped accepting reads and writes. Recovery automatically restarted both failed nodes. They came back online around 13:40, a new primary was elected, and full service was restored at 13:43, with no manual intervention required. **Next steps** We are working together with our database supplier to identify the underlying root cause, it could be hardware related. A follow-up will be shared once those findings are confirmed