Browse Source

docs: reword nodes-ephemerality for clarity (#19604)

Ilya Mashchenko 1 month ago
parent
commit
80a0e201da
1 changed files with 57 additions and 56 deletions
  1. 57 56
      docs/nodes-ephemerality.md

+ 57 - 56
docs/nodes-ephemerality.md

@@ -1,89 +1,90 @@
-
 # Nodes Ephemerality in Netdata
 
 ## Overview
 
-In distributed monitoring environments, maintaining a reliable and consistent observability system is crucial. Netdata v2.23 introduces significant improvements to how ephemeral nodes are managed, ensuring a better balance between alerting consistency and flexibility for transient infrastructure.
-
-Previously, ephemeral nodes were defined as "nodes that are forgotten a day after they last disconnect." This approach sometimes led to unexpected inconsistencies in monitoring, particularly for users operating highly dynamic environments. With v2.23, ephemeral nodes are now defined as "nodes that are expected to disconnect without alerts being raised."
-
-This change serves three key objectives:
-
-1.  **Stronger Monitoring Consistency for Permanent Nodes**: By ensuring that only permanent nodes trigger disconnection alerts, users can focus on real operational issues without being overwhelmed by alert noise.
-
-2.  **Enhanced Flexibility for Transient Environments**: Users managing auto-scaling cloud instances, containers, and other volatile infrastructure can now configure nodes as ephemeral, preventing unnecessary alerts and making monitoring more effective.
-
-3.  **Automated Cleanup for Ephemeral Nodes**: Netdata provides an automated way for the monitoring system to clean up itself by "forgetting" ephemeral nodes after a defined period. By default, the retention period is determined by the parent nodes' data retention settings. However, given that Netdata's tiered storage may provide retention for months or years, users may configure a shorter expiration time for ephemeral nodes.
+Netdata v2.3.0 changes how ephemeral nodes are defined and managed in distributed monitoring environments This update enhances monitoring reliability while providing flexibility for dynamic infrastructure management.
 
-By introducing these changes, Netdata significantly enhances its ability to monitor itself as a mesh-like distributed observability system, ensuring that alerts reflect actual system health rather than expected, routine disconnections. Additionally, the automatic cleanup feature prevents stale ephemeral nodes from accumulating in the monitoring system, keeping dashboards clean and up-to-date.
+**Key Changes**:
 
-## Understanding Ephemeral Nodes
+Netdata now defines ephemeral nodes as "nodes that are expected to disconnect without raising alerts," replacing the previous definition of nodes that are forgotten after one day of disconnection. This change provides three major benefits:
 
-When it comes to ephemerality, Netdata supports 2 types of nodes:
+1. **Improved Permanent Node Monitoring**: Disconnection alerts are now triggered only for permanent nodes, reducing alert noise and helping teams focus on genuine operational issues.
+2. **Better Support for Dynamic Infrastructure**: Organizations using auto-scaling cloud instances, containers, and other dynamic resources can now designate nodes as ephemeral, preventing unnecessary alerts.
+3. **Automated Node Management**: The system automatically removes ephemeral nodes based on configurable retention periods, maintaining clean and relevant monitoring dashboards.
 
--   **Ephemeral Nodes**: nodes that are expected to disconnect and/or reconnect frequently, or nodes that are expected to shut down or vanish at any point in time. Such nodes may be:
+## Node Types
 
-    -   Auto-scaling cloud instances.
-    -   Containers and VMs that are created and destroyed dynamically.
-    -   IoT devices with intermittent connectivity.
-    -   Development/test environments where nodes frequently restart.
--   **Permanent Nodes**: nodes that are expected to always be online, and disconnections are a strong indication of some kind of failure that operations teams should be aware of.
+Netdata supports two types of nodes:
 
+| Type      | Description                                          | Common Examples                                                                                                                                                             |
+|-----------|------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Ephemeral | Nodes expected to disconnect or reconnect frequently | • Auto-scaling cloud instances<br/>• Dynamic containers and VMs<br/>• IoT devices with intermittent connectivity<br/>• Development/test environments with frequent restarts |
+| Permanent | Nodes expected to maintain continuous connectivity   | • Production servers<br/>• Core infrastructure nodes<br/>• Critical monitoring systems<br/>• Stable database servers                                                        |
 
-## Configuring Ephemeral Nodes
+> **Note**: Disconnections in permanent nodes indicate potential system failures requiring immediate attention.
 
-By default, all nodes in Netdata are **permanent**. Users can mark nodes as ephemeral like this:
+## Setting Up Ephemeral Nodes
 
-At the `netdata.conf` of the ephemeral node, set:
+By default, Netdata treats all nodes as permanent. To mark a node as ephemeral:
 
-```ini
-[global]
-   is ephemeral node = yes
+1. Open `netdata.conf` on the target node
+2. Add the following configuration:
+   ```ini
+   [global]
+     is ephemeral node = yes
+   ```
+3. Restart the node
 
-```
-
-And restart the node. This ephemerality flag is propagated to Netdata Parents and Netdata Cloud via the `_is_ephemeral` host label (boolean: true/false).
+This configuration sets the `_is_ephemeral` host label which propagates to Netdata Parents and Netdata Cloud.
 
-## Netdata Parents Alerts
+## Alerts: Parent Node Alerts
 
-Netdata v2.23 introduces two alerts for **permanent** nodes:
+Netdata v2.3.0 adds [two alerts](https://github.com/netdata/netdata/blob/master/src/health/health.d/streaming.conf) specifically for permanent nodes:
 
--   `streaming_never_connected`: Counts the number of **permanent** nodes never connected to a Netdata Parent (since its last restart) and transitions to WARNING when this number is non-zero.
--   `streaming_disconnected`: Counts the number of **permanent** nodes that have been connected but are now disconnected from the Netdata Parent, and transitions to WARNING when this number is non-zero.
+| Alert                     | Triggers                                                      |
+|---------------------------|---------------------------------------------------------------|
+| streaming_never_connected | When permanent nodes have never connected to a Netdata Parent |
+| streaming_disconnected    | When previously connected permanent nodes disconnect          |
 
-To identify the exact nodes that trigger these alarms, use the `Netdata-streaming` function under the `Top` tab of the dashboard. This Netdata Function presents a list (a table) of all nodes known to a Netdata Parent and provides detailed state information for the lifecycle of the node, including database status, ingestion status, streaming status, health and alerts status, and more.
+## Monitoring Child Node Status
 
-In this table, red lines indicate a problem during ingestion, and yellow lines indicate a problem during (re)streaming. Colored lines are only related to **permanent** nodes. Filter the table using `Ephemerality` by selecting `permanent` and use the table columns `InStatus`, `InReason`, and `InAge` to understand the ingestion issue at hand. Similarly, for (re)streaming to another Netdata Parent, use `OutStatus`, `OutReason`, and `OutAge`.
+To investigate alert:
 
-### How to Mark Archived Nodes as Ephemeral
+1. Navigate to the `Top` tab in your dashboard
+2. Select the `Netdata-streaming` function
+3. Review the detailed node status table:
+    - Red lines: Node connection problems (when nodes attempt to connect to this Parent)
+    - Yellow lines: Restreaming issues (when this Parent attempts to stream data to other Parent nodes)
+    - Color highlighting applies only to permanent nodes
+    - Filter by `Ephemerality` to focus on permanent nodes
+    - Use `InStatus`, `InReason`, and `InAge` columns fto analyze nodes connecting to this parent
+    - Use `OutStatus`, `OutReason`, and `OutAge` columns to analyze this Parent's restreaming to other Parent nodes
 
-In case there are **permanent** nodes that are no longer available, in order to clear the alerts, the following command must be run on each of the Netdata Parents having these alerts raised:
+## Managing Archived Nodes
 
-```sh
-netdatacli mark-stale-nodes-ephemeral ALL_NODES
+To clear alerts for permanently offline nodes:
 
+```bash
+netdatacli mark-stale-nodes-ephemeral <node_id | machine_guid | hostname | ALL_NODES>
 ```
 
-This command instructs Netdata to mark as **ephemeral** all the nodes not currently online.
-
-Keep in mind that nodes will be marked again as **permanent** if they reconnect and they have not been configured in their `netdata.conf` to be **ephemeral**. So, marking them at the parents is only useful for nodes that are not expected to connect again.
+> **Note**: Nodes will revert to permanent status if they reconnect unless configured as ephemeral in their `netdata.conf`.
 
-## Netdata Cloud Alerts
+## Cloud Integration
 
-Before Netdata v2.23, Netdata Cloud was sending node unreachable notifications for all nodes, independently of their ephemerality.
+Starting with v2.3.0, Netdata Cloud sends node-unreachable notifications **exclusively for permanent nodes**, improving alert relevance.
 
-Since Netdata v2.23, Netdata Cloud is sending node unreachable notifications only for **permanent** nodes.
+## Automatic Ephemeral Nodes Cleanup
 
-## Automatically "Forgetting" Ephemeral Nodes
+The automatic removal of disconnected ephemeral nodes is disabled by default in v2.3.0+. To enable this feature:
 
-Netdata versions prior to v2.23 were automatically "forgetting" ephemeral nodes if they disconnected for more than 1 day. In Netdata v2.23+, this feature is now **disabled** by default.
+1. Edit the `netdata.conf` file on Netdata Parent nodes
+2. Add the following configuration:
 
-To enable it again, set this in `netdata.conf` of the Netdata Parents that are expected to "forget" the ephemeral nodes:
-
-```ini
-[db]
-   cleanup ephemeral hosts after = 1d
-
-```
+   ```ini
+   [db]
+     cleanup ephemeral hosts after = 1d
+   ```
+3. Restart the node
 
-The above instructs the Netdata Parent to automatically "forget" ephemeral nodes 1 day after they disconnect. When a node is "forgotten," its data is no longer available for queries, and when all parents reporting the node to Netdata Cloud "forget" it, Netdata Cloud automatically deletes the node.
+This setting removes ephemeral nodes from queries 24 hours after disconnection. When all parent nodes remove a node, Netdata Cloud automatically deletes it too.