Netdata incident

Slow and failing Agent chart data responses

Critical Resolved View vendor source →

Netdata experienced a critical incident on June 29, 2022 affecting Agent - Cloud Connection (ACLK), lasting 1d 8h. The incident has been resolved; the full update timeline is below.

Started
Jun 29, 2022, 08:54 AM UTC
Resolved
Jun 30, 2022, 05:35 PM UTC
Duration
1d 8h
Detected by Pingoru
Jun 29, 2022, 08:54 AM UTC

Affected components

Agent - Cloud Connection (ACLK)

Update timeline

  1. investigating Jun 29, 2022, 08:54 AM UTC

    Users with nightly versions of the Netdata Agent are experiencing slow responses between Cloud and Agent, resulting in failing or slow charts in their Cloud dashboards. We are investigating the issue.

  2. identified Jun 29, 2022, 02:11 PM UTC

    We have identified part of the cause of failing responses for alarm values. In yesterday's nightly build of the Agent, we enabled the use of the newer MQTT5 library by default. We will create another build to revert that. In the meanwhile, you can explicitly disable this library using the mqtt5 setting in your configuration as described here: https://github.com/netdata/cloud-backend/issues/178. Additionally the other latencies appear to be another instance of a known issue that causes responses with a small payload to be delayed. We are working on resolving this issue.

  3. monitoring Jun 30, 2022, 07:20 AM UTC

    The new nightly version of the Netdata Agent has been published and installed by a large portion of the agents that auto-update. We are monitoring the results.

  4. monitoring Jun 30, 2022, 08:03 AM UTC

    For completeness, the affected versions are v1.35.0-84-nightly and v1.35.0-96-nightly. Latest, corrected version is v1.35.0-104-nightly.

  5. resolved Jun 30, 2022, 05:35 PM UTC

    Reverting the default away from MQTT5 removed the immediate issue, and most Agents on the nightlies are now on the latest (v1.35.0-104-nightly). In the mean time we've also found the true cause: the Agent was not properly processing incoming commands in the MQTT5 implementation, due to a bug in how the parser interacted with the buffer of incoming data. This has been resolved in the upcoming nightly build of the Agent. As we want to do some more testing, for now the Agent will keep using the older MQTT library by default.