Netdata incident

Startup issue in latest Agent nightly (1.40.0-6-nightly)

Minor Resolved View vendor source →

Netdata experienced a minor incident on June 16, 2023 affecting Agent - Cloud Connection (ACLK) and Agent (all platforms), lasting 12h 3m. The incident has been resolved; the full update timeline is below.

Started
Jun 16, 2023, 05:55 AM UTC
Resolved
Jun 16, 2023, 05:58 PM UTC
Duration
12h 3m
Detected by Pingoru
Jun 16, 2023, 05:55 AM UTC

Affected components

Agent - Cloud Connection (ACLK)Agent (all platforms)

Update timeline

  1. investigating Jun 16, 2023, 05:55 AM UTC

    We are currently investigating an issue with agent connectivity to the cloud.

  2. identified Jun 16, 2023, 06:53 AM UTC

    Agents running the most recent nightly (1.40.0-6-nightly) fail to start on some platforms, because of a permissioning issue. We believe the culprit is this change: https://github.com/netdata/netdata/pull/14890, and are working on a fix. As this happens early on in the Agent, this affects Cloud and non-Cloud users alike.

  3. identified Jun 16, 2023, 08:11 AM UTC

    While we are working on a fix, which requires a new package to be built, we have developed a workaround. It requires downgrading the Agent to 1.40.0-2-nightly and fixing the permissions. For Debian based systems, this script should work, run as root: https://gist.github.com/ralphm/1326498c474aaacf0a12f9e569dac863

  4. identified Jun 16, 2023, 11:43 AM UTC

    We have created a fix for this issue, which is a combination of making systemd not change the ownership and permissions the directories the Agent uses, and the Agent properly changing permissions recursively to recover from the effects of the bad version. As soon as we've tested the fix, and the packages have been built, we will trigger an explicit push to the nightlies repos.

  5. identified Jun 16, 2023, 01:24 PM UTC

    The fix has been merged, we've kicked off the build process for the packages. We will provide another update when the packages for the affected systems have been pushed.

  6. monitoring Jun 16, 2023, 02:42 PM UTC

    The native packages for x86-based distributions have been published. The ARM ones are still building and should follow shortly, as well as the static builds. We're monitoring Netdata Cloud and the various social networking tools to monitor the outcome of the new builds.

  7. monitoring Jun 16, 2023, 02:57 PM UTC

    The source tarballs with the fix for native builds are now available. Packages for ARM systems are still building but should be fully published and available by 17:00 UTC at the latest.

  8. resolved Jun 16, 2023, 05:58 PM UTC

    All packages have been published. If your nodes are still on 1.40.0-6, please refer to the instructions to upgrade: https://learn.netdata.cloud/docs/maintaining/update-netdata-agents#updates-for-most-systems. We are now closing this incident, but please let us know if things are still not working on your nodes.