Netdata Cloud On-Prem Troubleshooting

Netdata Cloud On-Prem is an enterprise-grade monitoring solution that relies on several infrastructure components:

Databases: PostgreSQL, Redis, Elasticsearch
Message Brokers: Pulsar, EMQX
Traffic Controllers: Ingress, Traefik
Kubernetes Cluster

These components should be monitored and managed according to your organization's established practices and requirements.

Common Issues

Timeout During Installation

If your installation fails with this error:

Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart...
[...]
Error: client rate limiter Wait returned an error:  Context deadline exceeded.

This error typically indicates insufficient cluster resources. Here's how to diagnose and resolve the issue.

Diagnosis Steps

Important

For full installation: Ensure you're in the correct cluster context.

For Light PoC: SSH into the Ubuntu VM with kubectl pre-configured.

For Light PoC, always perform a complete uninstallation before attempting a new installation.

Check for pods stuck in Pending state:

kubectl get pods -n netdata-cloud | grep -v Running

If you find Pending pods, examine the resource constraints:
```
kubectl describe pod <POD_NAME> -n netdata-cloud
```

Review the Events section at the bottom of the output. Look for messages about:

- Insufficient CPU
- Insufficient Memory
- Node capacity issues

View overall cluster resources:

# Check resource allocation across nodes
kubectl top nodes
   
# View detailed node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"

Solution

Compare your available resources against the minimum requirements.
Take one of these actions:
- Add more resources to your cluster.
- Free up existing resources.

Login Issues After Installation

Installation may complete successfully, but login issues can occur due to configuration mismatches. This table provides a quick reference for troubleshooting common login issues after installation.

Issue	Symptoms	Cause	Solution
SSO Login Failure	Unable to authenticate via SSO providers	- Invalid callback URLs - Expired/invalid SSO tokens - Untrusted certificates - Incorrect FQDN in `global.public`	- Update SSO configuration in `values.yaml` - Verify certificates are valid and trusted - Ensure FQDN matches certificate
MailCatcher Login (Light PoC)	- Magic links not arriving - "Invalid token" errors	- Incorrect hostname during installation - Modified default MailCatcher values	- Reinstall with correct FQDN - Restore default MailCatcher settings - Ensure hostname matches certificate
Custom Mail Server Login	Magic links not arriving	- Incorrect SMTP configuration - Network connectivity issues	- Update SMTP settings in `values.yaml` - Verify network allows SMTP traffic - Check mail server logs
Invalid Token Error	"Something went wrong - invalid token" message	- Mismatched `netdata-cloud-common` secret - Database hash mismatch - Namespace change without secret migration	- Migrate secret before namespace change - Perform fresh installation - Contact support for data recovery

Warning

If you're modifying the installation namespace, the netdata-cloud-common secret will be recreated.

Before proceeding: Back up the existing netdata-cloud-common secret. Alternatively, wipe the PostgreSQL database to prevent data conflicts.

Slow Chart Loading or Chart Errors

When charts take a long time to load or fail with errors, the issue typically stems from data collection challenges. The charts service must gather data from multiple Agents within a Room, requiring successful responses from all queried Agents.

Issue	Symptoms	Cause	Solution
Agent Connectivity	- Queries stall or timeout - Inconsistent chart loading	Slow Agents or unreliable network connections prevent timely data collection	Deploy additional Parent nodes to provide reliable backends. The system will automatically prefer these for queries when available
Kubernetes Resources	- Service throttling - Slow data processing - Delayed dashboard updates	Resource saturation at the node level or restrictive container limits	Review and adjust container resource limits and node capacity as needed
Database Performance	- Slow query responses - Increased latency across services	PostgreSQL performance bottlenecks	Monitor and optimize database resource utilization: - CPU usage - Memory allocation - Disk I/O performance
Message Broker	- Delayed node status updates (online/offline/stale) - Slow alert transitions - Dashboard update delays	Message accumulation in Pulsar due to processing bottlenecks	- Review Pulsar configuration - Adjust microservice resource allocation - Monitor message processing rates

Need Help?

If issues persist:

Gather the following information:
- Installation logs
- Your cluster specifications
Contact support at support@netdata.cloud.

troubleshooting.md 7.4 KB История Исходник