Netdata incident

Recent nightly static and local builds of Netdata Agent overwrite netdata.conf with defaults

Minor Resolved View vendor source →

Netdata experienced a minor incident on May 1, 2024, lasting 1h 37m. The incident has been resolved; the full update timeline is below.

Started
May 01, 2024, 03:44 PM UTC
Resolved
May 01, 2024, 05:22 PM UTC
Duration
1h 37m
Detected by Pingoru
May 01, 2024, 03:44 PM UTC

Update timeline

  1. identified May 01, 2024, 03:44 PM UTC

    We've identified an issue with static and local builds of the Netdata Agent, that causes its main configuration in `/etc/netdata/netdata.conf` or `/opt/netdata/etc/netdata/netdata.conf` to be overwritten with the default. The `netdata-updater.conf` file is similarly affected. Depending on your configuration settings that have been changed with respect to the defaults, this may result in data loss. We will update this incident with more detailed information on the impact as soon as possible. Docker image or native package builds, as well as stable builds, are not affected. We have created a fix (https://github.com/netdata/netdata/pull/17572) and have triggered a new nightly build. As soon as those are available, we will also update this incident.

  2. identified May 01, 2024, 03:59 PM UTC

    The affected build number is v1.45.0-315-nightly, and local builds starting with commit https://github.com/netdata/netdata/commit/5973417027606bacf044b3ead40a882931ce773f (April 30, 11:45 UTC) up until commit https://github.com/netdata/netdata/commit/0f2a261839d5ffc42f17383b4292673aa93d6a1f (May 1, 15:13 UTC).

  3. identified May 01, 2024, 04:28 PM UTC

    Update regarding potential data loss. This will happen if the configuration had been changed to increase metric retention (with respect to the defaults). Unfortunately, any stored data beyond the default metric retention will be lost on running installs of the affected builds. The only way to prevent this is by not using (of having used) version v1.45.0-315-nightly. We have made sure that the corresponding artifacts are no longer accessible by the installer.

  4. resolved May 01, 2024, 05:22 PM UTC

    The build artifacts for the new nightly release (1.45.0-326) are now available, and consider the incident resolved. Should you experience any issues, please let us know!

  5. postmortem May 02, 2024, 10:02 AM UTC

    Prior to [netdata/netdata#17475](https://github.com/netdata/netdata/pull/17475), the `netdata.conf` and `netdata-updater.conf` files where handled by the installer code outside of the build system. With the shift to using the build system to produce packages, handling for them needed to be moved into the build system. However, insufficient testing was performed to confirm that this would not break other installation types, and the change was not properly made conditional on packages being built. As a result, the static and local builds with version `v1.45.0-315-nightly` will overwrite these configuration files with the default templates for those files. This causes all local changes to those files to be lost. In particular, if the Agent configuration had been changed for longer retention, the overwritten configuration will have undone those settings, causing any metrics data **beyond the _default_ retention to be lost** on the first run of this version. We have pulled the affected build artifacts to prevent our installer from using them. While [the fix](https://github.com/netdata/netdata/pull/17572) ensures the issue won't occur in future versions, starting with version `v1.45.0-326-nightly`, it is important to note that affected installations **will not automatically recover** their previous configurations. If you were using a non-default `netdata.conf` and/or `netdata-updater.conf` and experienced this bug, you will need to **manually reconfigure** your Netdata install. As we aim to carefully develop Netdata for many platforms and hardware architectures, we release nightly builds of the Netdata Agent to catch any issues our changes may have caused, beyond our own internal testing. Unfortunately, we make mistakes that we did not catch in our testing, with data loss as an extreme possible outcome. Therefore we strongly recommend using our **stable releases for production systems**. You can review the [difference between nightly and stable builds](https://learn.netdata.cloud/docs/netdata-agent/installation#nightly-vs-stable-releases), and our recommended [best practices](https://www.netdata.cloud/blog/netdata-best-practices/). If you have been affected by this issue and/or have any questions, please let us know.