This alert is triggered when the number of failed volumes in your Hadoop Distributed File System (HDFS) cluster increases. A failed volume may be due to hardware failure or misconfiguration, such as duplicate mounts. When a single volume fails on a DataNode, the entire node may go offline depending on the dfs.datanode.failed.volumes.tolerated
setting for your cluster. This can lead to increased network traffic and potential performance degradation as the NameNode needs to copy any under-replicated blocks lost on that node.
Use the dfsadmin -report
command to identify the DataNodes that are offline:
root@netdata # dfsadmin -report
Find any nodes that are not reported in the output of the command. If all nodes are listed, you'll need to run the next command for each DataNode.
Use the hdfs dfsadmin -getVolumeReport
command, specifying the DataNode hostname and port:
root@netdata # hdfs dfsadmin -getVolumeReport datanodehost:port
Connect to the affected DataNode and check its logs using journalctl -xe
. If you have the Netdata Agent running on the DataNodes, you should be able to identify the problem. You may also receive alerts about the disks and mounts on this system.
Based on the information gathered in the previous steps, take appropriate actions to resolve the issue. This may include:
Note: When working with HDFS, it's essential to have proper backups of your data. Netdata is not responsible for any loss or corruption of data, database, or software.