Honeycomb.io incident

Degraded query performance

Honeycomb.io experienced a major incident on April 16, 2025, lasting 52m. The incident has been resolved; the full update timeline is below.

Started: Apr 16, 2025, 05:35 PM UTC
Resolved: Apr 16, 2025, 06:27 PM UTC
Duration: 52m
Detected by Pingoru: Apr 16, 2025, 05:35 PM UTC

Update timeline

identified Apr 16, 2025, 05:35 PM UTC

We have identified resource contention that currently leads to degraded query performance, which has slowed down most querying types for the last hour. The situation seems to be improving but we are keeping an eye on it.
identified Apr 16, 2025, 05:55 PM UTC

We are continuing to work on a fix for this issue.
monitoring Apr 16, 2025, 05:56 PM UTC

Performance is now back to normal. We have added Triggers and SLOs to the list of impacted services, and upgraded the impact to Major given some triggers did not run.
resolved Apr 16, 2025, 06:27 PM UTC

The system is stable and performance should be back to normal.
postmortem Apr 22, 2025, 07:39 PM UTC

On April 16, we’ve experienced 55 minutes of degraded query performance in interactive queries and board rendering for a dozen or so teams. During this time, queries that were usually fast would have started taking much longer than usual, from less than 5 seconds to about a minute. More importantly though, for about 25 minutes, the evaluation of triggers and SLOs in our US region was interrupted, meaning alerts may have been delayed or missed. The detection of slow queries mostly came up through customers reaching out to us. On our end, the main performance SLOs never fell below their thresholds and we overall were within our budget. We associated the raising delays to an increase in shared lambda resource, caused by background tasks being queued up, which in turn created contention for some queries. As we started an internal incident to handle this, we were paged about our alerting subsystem not reporting as healthy. We saw the contention in the underlying resources as the main contributor and tweaked some rate limiting parameters to ensure overall usage came back to manageable levels. As we did so, the alerting system also recovered. We monitored the system and made sure it was functioning as normal for a while before closing the incident. Our investigation mostly focused on what exactly caused alerting to hang, a behavior that surprised every responder. A key behavior we focused on was that the system worked fine under pressure until an automated deployment happened. We eventually found out that while resource contention in our lambdas did lead to slowness for queries, it was coming back from the deployment while under pressure that caused the stalling. As it turns out, that application does gradual backfilling of recently changed SLOs in the background. However, in its initial iteration, it performs this task at boot time in the foreground and _then_ moves it to the background. Because the application restarted while the system was under heavy contention, it stalled on that first run, and did not recover while load remained high. When we solved the contention issue, background jobs managed to finish, then moved to be asynchronous, and alerting came back. Our two follow-up actions have been to tweak the alerting for our triggers and SLO components so they page roughly 3-5x faster next time, and to make sure the first evaluation of background tasks is done asynchronously, as we initially expected them to be. We do not plan on doing further in-depth reviews of this incident at this time.