SchemeServe incident

Intermittent service issues

Minor Resolved View vendor source →

SchemeServe experienced a minor incident on January 8, 2026 affecting 🎩 SchemeServe, lasting 20h. The incident has been resolved; the full update timeline is below.

Started
Jan 08, 2026, 03:26 PM UTC
Resolved
Jan 09, 2026, 11:26 AM UTC
Duration
20h
Detected by Pingoru
Jan 08, 2026, 03:26 PM UTC

Affected components

🎩 SchemeServe

Update timeline

  1. investigating Jan 08, 2026, 03:26 PM UTC

    SchemeServe is currently experiencing intermittent service issues. This is being investigated at the highest priority.

  2. monitoring Jan 08, 2026, 05:01 PM UTC

    We are continuing to monitor this.

  3. resolved Jan 09, 2026, 11:26 AM UTC

    This incident has been resolved.

  4. postmortem Jan 13, 2026, 12:40 PM UTC

    ### Summary On Thursday 8th January 2025 between 14:53 and 17:33, SchemeServe experienced an incident that resulted in intermittent request timeouts for some customers. This occurred when customer requests relied on additional internal service-to-service calls, which were delayed due to unexpected back-pressure within our centralised logging infrastructure. The issue did not affect all customers or all requests, and no data was lost. However, we recognise that any request timeout impacts customer trust, and we sincerely apologise for the disruption. ### Customer Impact During the incident window: * Some customer requests timed out when they depended on an additional internal service call * The impact was intermittent and limited to a subset of workflows * No customer data was lost or corrupted * Platform availability remained intact, but responsiveness was degraded for affected requests * At its peak, this issue affected less than 1% of calls made to SchemeServe We understand the importance of reliability and are committed to ensuring this does not happen again. ### What Happened SchemeServe uses a centralised logging service to collect logs from all internal services. To optimise performance, logs are batched in in-memory queues before being written to the database. On the day of the incident, there was a significant increase in both the volume and size of log messages. Under this usage, the client unintentionally created a new HTTP client per request, leading to a much higher number of concurrent connections than expected. As traffic increased, several compounding factors interacted: 1. In-memory log queues filled faster than they could be drained, increasing memory pressure. 2. Database writes slowed due to connection pool contention, triggering retry logic. 3. Retry delays were longer than intended, causing queue drainage to slow further. 4. Incoming log requests waited for space in the in-memory queue instead of failing fast, holding HTTP connections open. 5. Over time, this exhausted available HTTP sockets. 6. Once sockets were exhausted, other internal service-to-service calls were forced to wait or time out. 7. Customer requests that depended on these internal calls also timed out. In parallel, logs that could not be written immediately were written to storage for later ingestion. This prevented data loss, but when the logging service attempted to ingest these logs in the background, it added additional load at a time when the system was already under pressure. Individually, each mechanism exists to improve resilience. In this case, the combination of increased load, large payload sizes, retry timing, connection handling, and scaling limits created a cascading issue. ### Detection and Mitigation Once the issue was identified: * Log ingestion through the logging service was temporarily disabled * Services automatically fell back to writing logs to storage * This reduced pressure on HTTP connections and allowed internal services to recover * Once traffic returned to normal levels, all stored logs were successfully processed Service performance returned to normal shortly after mitigation was applied. ### Root Cause The root cause was unexpected back-pressure in the logging pipeline caused by a combination of: * Increased log volume and size * Inefficient HTTP client reuse under this usage pattern * Slow retry backoff when database connections were exhausted * In-memory queue behaviour that held connections open rather than failing fast * Limited horizontal scaling under memory pressure These factors interacted in a way that was not observed during prior testing. ### Why This Was Not Detected Earlier Although extensive load testing was performed, this scenario did not surface because it required a specific combination of conditions: * High request throughput * Large individual log payloads * Slower-than-expected retry delays * High concurrency of short-lived HTTP clients * Memory pressure preventing timely scaling * Previous similar ossurances in the production environment had managed successfulyl using designed fallback systems Each condition was tested independently, but the combined effect was not fully represented in test scenarios. ### Actions we are taking to Prevent Recurrence We have implemented the following improvements: * Reduced database retry backoff times to prevent prolonged queue blocking * Increased memory allocation for in-memory log queues * Introduced dedicated HTTP clients for log writes with short, bounded timeouts * Reduced inter-service timeout durations to avoid holding connections unnecessarily * Improved separation of critical and non-critical log ingestion paths We are also adjusting our deployment strategy to roll out changes incrementally and observe system behaviour under real traffic before full rollout. ### Resolution and Recovery The incident was resolved by relieving pressure on the logging service and allowing internal services to recover available connections. Once normal operating conditions returned, all queued and stored logs were processed successfully. ### Closing We apologise for the disruption this incident caused. While the underlying issue occurred within our internal infrastructure, we recognise that the impact was customer-visible and take responsibility for it. The lessons learned from this incident have already led to concrete improvements that strengthen the resilience of our platform. We remain committed to delivering a reliable, performant service and to being transparent when issues occur. If you have questions or would like further detail, please contact the SchemeServe support team.