Datacake incident

API outage

Datacake experienced a major incident on March 17, 2026, lasting —. The incident has been resolved; the full update timeline is below.

Started: Mar 17, 2026, 02:14 PM UTC
Resolved: Mar 17, 2026, 02:14 PM UTC
Duration: —
Detected by Pingoru: Mar 17, 2026, 02:14 PM UTC

Update timeline

resolved Mar 17, 2026, 02:14 PM UTC

Type: Incident Duration: 5 hours and 16 minutes Mar 17, 19:30:32 GMT+0 - Resolved - All systems have fully recovered. The ingestion backlog has been cleared and incoming data is now being processed in real time again. Dashboards and API responses are up to date. We’ll publish a detailed postmortem shortly with more information on what happened and what we’re doing to prevent this in the future. Mar 17, 19:39:02 GMT+0 - Postmortem - ### Summary During a routine deployment, we introduced a database migration that ended up degrading the performance of our primary database more than expected. This slowed down our ingestion pipeline and caused a growing backlog of incoming measurements. As the backlog increased, it exposed a weakness in how our ingestion workers handle concurrency. Instead of recovering once the database performance stabilized, the system got stuck in a state where workers were effectively slowing each other down. That meant the backlog couldn’t clear itself, and data delays persisted longer than they should have. Throughout the incident, no measurement data was lost. However, dashboards and API responses showed outdated values until the system fully recovered. --- ### Root Cause There were two things at play here. The initial trigger was the database migration, which temporarily made queries slower and reduced ingestion throughput. That part alone would have been manageable. The bigger issue was how our ingestion workers behaved under pressure. They were configured to handle many tasks concurrently using shared database and cache connections. Under normal conditions, this works well and is efficient. But once the backlog built up, tasks started competing for those shared resources. Instead of working through the queue faster, they got in each other’s way, turning a temporary slowdown into a sustained bottleneck. --- ### Resolution We first addressed the database performance issue to remove the initial trigger. After that, we changed how ingestion workers run. Instead of using a shared, thread-based model, we moved to process-based workers where each one has its own dedicated connections. This removed the contention entirely, and the system caught up within minutes. --- ### Follow-up Actions We’re adding better visibility into this kind of situation, especially around queue depth and processing times, so we can react earlier if something similar happens again. We’re also reviewing other parts of the system for similar concurrency patterns that could behave poorly under load. Finally, we’ll introduce an additional staging step for database migrations to catch performance regressions before they reach production. --- If anything was unclear during the incident or you have questions, feel free to reach out. Mar 17, 14:14:05 GMT+0 - Investigating - We are currently investigating issues with the availability of our API. Mar 17, 14:22:06 GMT+0 - Identified - We have identified the issue to be caused by an upstream database. We are continuing to work on a fix for this incident. Mar 17, 14:39:51 GMT+0 - Monitoring - We identified a number of database queries that got stuck, which impacted overall performance. After clearing those, things are back to normal again. There is still a backlog of data being processed, so you might notice slight delays in data ingestion until the queue has fully caught up. We’re keeping a close eye on it and will share further updates if needed. Mar 17, 17:28:43 GMT+0 - Monitoring - There is still a backlog of data being processed, which is currently causing ingestion delays of around 30–40 minutes. We’re actively working on improving queue performance to reduce this delay and will continue to keep you updated as things progress.