Deployments recovered
Timeline · 1 update
- resolved Jun 03, 2026, 07:35 PM UTC
Deployments recovered
Trigger.dev had 18 outages in the last 2 years totaling 97h 43m of downtime — averaging 0.7 incidents per month.
There were 18 Trigger.dev outages since December 1, 2025 totaling 97h 43m of downtime. Each is summarised below — incident details, duration, and resolution information.
Deployments recovered
Realtime recovered
Realtime metadata updates and streaming v1 are not live, they've fallen behind. We're trying to remediate this.
Realtime is back to live. We're really sorry for this extended period of large delays. The service couldn't keep up the number of runs being processed and was falling further behind. We have made some configuration changes and upgraded it so it can cope with a higher throughput of runs. If you were using our React hooks that just did streaming, they were unimpacted by this.
From user runs we're seeing an increase in DNS related issues like: Error: getaddrinfo ENOTFOUND Error: getaddrinfo EAI_AGAIN We're investigating why this is happening.
DNS service is now back to fully operational. Increased traffic combined with a routine infrastructure rollout caused intermittent DNS resolution failures. We've tuned our DNS configuration to resolve the issue and are working on longer-term improvements to prevent recurrence.
The runs list and detail pages in the dashboard are currently degraded due to an ongoing issue with our ClickHouse DB. We're also observing some logs and span ingestion failures. We're currently investigating. Run executions are not impacted.
The issue has been resolved. Dashboard and telemetry are now fully operational.
Dequeues are slower than normal in us-east-1. Runs are still executing, but they are slower to start. We’re investigating the issue.
The issue is now resolved and dequeue times are back to normal. Mainly free-tier runs were affected. This was caused by a spike in the free-tier run volume.
We are experiencing intermittent issues that may cause some task runs to fail. Automatic retries are in place and should recover most affected runs. Our team is actively working on resolution.
Full service has been restored. Task execution is back to normal. If you experienced failures between 01:37 and 04:19 UTC, those runs can be retried successfully now. What happened: During a period of high activity, a backlog of completed runs built up faster than our cleanup processes could handle, which put pressure on internal services and caused intermittent failures. What we did: We spun up additional cleanup capacity to clear the backlog and restore normal operation. What we're doing next: We're increasing resource limits on critical internal services and adding better alerting so we can catch this earlier if it happens again.
We had a brief outage earlier which affected a subset of schedules. We are working on a fix to get them going again.
All schedules have been fully restored.
Our task log storage system is currently overloaded and we are working on bringing up additional capacity, but in the meantime some logs may be lost.
We have finally been able to provision additional capacity and logs are working again. A full post-mortem will follow.
Our run sync to clickhouse process is currently delayed. The runs list in the dashboard will be behind but runs are executing as normal.
The runs list is now up to do and syncing live updates again.
There is a backlog in processing batchTrigger and batchTriggerAndWait calls. This means runs are being created slower than normal for these. We're investigating why this is happening
The new batch concurrency processing defaults have brought the processing queue down to zero
Runs are not syncing to our clickhouse instances fast enough and so there is a delay in data in the runs list dashboard. Runs are operating normally.
Runs are now syncing live and the dashboard is back to normal.
Writes and reads to Realtime streams v2 are currently suffering an outage and we're investigating.
Fix has been applied and realtime streams v2 is fully operational.
We’re seeing a percentage of queries failing from ClickHouse Cloud which powers some pages in the dashboard, like Tasks graphs, Runs page and the logs. We’re talking to their team to try resolve this.
Operations have returned to normal, we're continuing to investigate the root cause and will provide more detail as we know more.
The dashboard is currently degraded due to an ongoing issue with our ClickHouse DB. We're currently investigating further. Run executions are not impacted.
The issue in ClickHouse is now resolved. The dashboard is back to being fully operational. The root cause was a faulty node in the ClickHouse cluster which we couldn't kill. We're speaking to the ClickHouse Cloud team to find out why it happened.
The runs list is currently showing stale data. Runs are executing like normal. Our replication process from postgresql to our Clickhouse instance is falling behind and so the dashboard will be showing stale run data. We're investigating.
The runs list has all caught up and the dashboard is no longer displaying stale data. We're continuing to investigate the root cause of this issue
We are currently having issues with our ingestion of open telemetry logs and spans after rolling out a fix for the issue that was happening over the weekend with clickhouse. We're investigating
We've published a full post-mortem on this incident here: https://trigger.dev/blog/clickhouse-too-many-parts-postmortem