Ilya Mashchenko 37fd96700b fix not handling N/A value in python.d/nvidia_smi (#15231)		1 year ago
..
Makefile.inc	675383b26a Makefile.am files indentation (#7252)	5 years ago
README.md	5b676d5f91 Add Docker instructions to enable Nvidia GPUs (#14924)	1 year ago
metrics.csv	e5a5aaa4cb Add metadata.csv to python.d.plugin (#14959)	1 year ago
nvidia_smi.chart.py	37fd96700b fix not handling N/A value in python.d/nvidia_smi (#15231)	1 year ago
nvidia_smi.conf	7c33c4c70f nvidia_smi: Not count users with zero memory allocated (#10098)	4 years ago

Nvidia GPU collector

Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using nvidia-smi cli tool.

Requirements and Notes

You must have the nvidia-smi tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about nvidia_smi.
You must enable this plugin, as its disabled by default due to minor performance issues:
```
cd /etc/netdata   # Replace this path with your Netdata config directory, if different
sudo ./edit-config python.d.conf
```
Remove the '#' before nvidia_smi so it reads: nvidia_smi: yes.
On some systems when the GPU is idle the nvidia-smi tool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn't likely to be an issue.
Currently the nvidia-smi tool is being queried via cli. Updating the plugin to use the nvidia c/c++ API directly should resolve this issue. See discussion here: https://github.com/netdata/netdata/pull/4357
Contributions are welcome.
Make sure netdata user can execute /usr/bin/nvidia-smi or wherever your binary is.
If nvidia-smi process is not killed after netdata restart you need to off loop_mode.
poll_seconds is how often in seconds the tool is polled for as an integer.

Charts

It produces the following charts:

PCI Express Bandwidth Utilization in KiB/s
Fan Speed in percentage
GPU Utilization in percentage
Memory Bandwidth Utilization in percentage
Encoder/Decoder Utilization in percentage
Memory Usage in MiB
Temperature in celsius
Clock Frequencies in MHz
Power Utilization in Watts
Memory Used by Each Process in MiB
Memory Used by Each User in MiB
Number of User on GPU in num

Configuration

Edit the python.d/nvidia_smi.conf configuration file using edit-config from the Netdata config directory, which is typically at /etc/netdata.

cd /etc/netdata   # Replace this path with your Netdata config directory, if different
sudo ./edit-config python.d/nvidia_smi.conf

Sample:

loop_mode    : yes
poll_seconds : 1
exclude_zero_memory_users : yes

Troubleshooting

To troubleshoot issues with the nvidia_smi module, run the python.d.plugin with the debug option enabled. The output will give you the output of the data collection job or error messages on why the collector isn't working.

First, navigate to your plugins directory, usually they are located under /usr/libexec/netdata/plugins.d/. If that's not the case on your system, open netdata.conf and look for the setting plugins directory. Once you're in the plugin's directory, switch to the netdata user.

cd /usr/libexec/netdata/plugins.d/
sudo su -s /bin/bash netdata

Now you can manually run the nvidia_smi module in debug mode:

./python.d.plugin nvidia_smi debug trace

Docker

GPU monitoring in a docker container is possible with nvidia-container-toolkit installed on the host system, and gcompat added to the NETDATA_EXTRA_APK_PACKAGES environment variable.

Sample docker-compose.yml

version: '3'
services:
  netdata:
    image: netdata/netdata
    container_name: netdata
    hostname: example.com # set to fqdn of host
    ports:
      - 19999:19999
    restart: unless-stopped
    cap_add:
      - SYS_PTRACE
    security_opt:
      - apparmor:unconfined
    environment:
      - NETDATA_EXTRA_APK_PACKAGES=gcompat
    volumes:
      - netdataconfig:/etc/netdata
      - netdatalib:/var/lib/netdata
      - netdatacache:/var/cache/netdata
      - /etc/passwd:/host/etc/passwd:ro
      - /etc/group:/host/etc/group:ro
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /etc/os-release:/host/etc/os-release:ro
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]

volumes:
  netdataconfig:
  netdatalib:
  netdatacache:

Sample docker run

docker run -d --name=netdata \
  -p 19999:19999 \
  -e NETDATA_EXTRA_APK_PACKAGES=gcompat \
  -v netdataconfig:/etc/netdata \
  -v netdatalib:/var/lib/netdata \
  -v netdatacache:/var/cache/netdata \
  -v /etc/passwd:/host/etc/passwd:ro \
  -v /etc/group:/host/etc/group:ro \
  -v /proc:/host/proc:ro \
  -v /sys:/host/sys:ro \
  -v /etc/os-release:/host/etc/os-release:ro \
  --restart unless-stopped \
  --cap-add SYS_PTRACE \
  --security-opt apparmor=unconfined \
  --gpus all \
  netdata/netdata

Docker Troubleshooting

To troubleshoot nvidia-smi in a docker container, first confirm that nvidia-smi is working on the host system. If that is working correctly, run docker exec -it netdata nvidia-smi to confirm it's working within the docker container. If nvidia-smi is fuctioning both inside and outside of the container, confirm that nvidia-smi: yes is uncommented in python.d.conf.

docker exec -it netdata bash
cd /etc/netdata
./edit-config python.d.conf

README.md