Netdata incident
Startup issue in latest Agent nightly (1.40.0-6-nightly)
Netdata experienced a minor incident on June 16, 2023 affecting Agent - Cloud Connection (ACLK) and Agent (all platforms), lasting 12h 3m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jun 16, 2023, 05:55 AM UTC
We are currently investigating an issue with agent connectivity to the cloud.
- identified Jun 16, 2023, 06:53 AM UTC
Agents running the most recent nightly (1.40.0-6-nightly) fail to start on some platforms, because of a permissioning issue. We believe the culprit is this change: https://github.com/netdata/netdata/pull/14890, and are working on a fix. As this happens early on in the Agent, this affects Cloud and non-Cloud users alike.
- identified Jun 16, 2023, 08:11 AM UTC
While we are working on a fix, which requires a new package to be built, we have developed a workaround. It requires downgrading the Agent to 1.40.0-2-nightly and fixing the permissions. For Debian based systems, this script should work, run as root: https://gist.github.com/ralphm/1326498c474aaacf0a12f9e569dac863
- identified Jun 16, 2023, 11:43 AM UTC
We have created a fix for this issue, which is a combination of making systemd not change the ownership and permissions the directories the Agent uses, and the Agent properly changing permissions recursively to recover from the effects of the bad version. As soon as we've tested the fix, and the packages have been built, we will trigger an explicit push to the nightlies repos.
- identified Jun 16, 2023, 01:24 PM UTC
The fix has been merged, we've kicked off the build process for the packages. We will provide another update when the packages for the affected systems have been pushed.
- monitoring Jun 16, 2023, 02:42 PM UTC
The native packages for x86-based distributions have been published. The ARM ones are still building and should follow shortly, as well as the static builds. We're monitoring Netdata Cloud and the various social networking tools to monitor the outcome of the new builds.
- monitoring Jun 16, 2023, 02:57 PM UTC
The source tarballs with the fix for native builds are now available. Packages for ARM systems are still building but should be fully published and available by 17:00 UTC at the latest.
- resolved Jun 16, 2023, 05:58 PM UTC
All packages have been published. If your nodes are still on 1.40.0-6, please refer to the instructions to upgrade: https://learn.netdata.cloud/docs/maintaining/update-netdata-agents#updates-for-most-systems. We are now closing this incident, but please let us know if things are still not working on your nodes.