Netdata Cloud On-Prem is an enterprise-grade monitoring solution that relies on several infrastructure components:
These components should be monitored and managed according to your organization's established practices and requirements.
If your installation fails with this error:
Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart...
[...]
Error: client rate limiter Wait returned an error: Context deadline exceeded.
This error typically indicates insufficient cluster resources. Here's how to diagnose and resolve the issue.
Important
- For full installation: Ensure you're in the correct cluster context.
- For Light PoC: SSH into the Ubuntu VM with
kubectl
pre-configured.- For Light PoC, always perform a complete uninstallation before attempting a new installation.
Check for pods stuck in Pending state:
kubectl get pods -n netdata-cloud | grep -v Running
If you find Pending pods, examine the resource constraints:
kubectl describe pod <POD_NAME> -n netdata-cloud
Review the Events section at the bottom of the output. Look for messages about:
- Insufficient CPU
- Insufficient Memory
- Node capacity issues
View overall cluster resources:
# Check resource allocation across nodes
kubectl top nodes
# View detailed node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"
Installation may complete successfully, but login issues can occur due to configuration mismatches. This table provides a quick reference for troubleshooting common login issues after installation.
Issue | Symptoms | Cause | Solution |
---|---|---|---|
SSO Login Failure | Unable to authenticate via SSO providers | - Invalid callback URLs - Expired/invalid SSO tokens - Untrusted certificates - Incorrect FQDN in global.public |
- Update SSO configuration in values.yaml - Verify certificates are valid and trusted - Ensure FQDN matches certificate |
MailCatcher Login (Light PoC) | - Magic links not arriving - "Invalid token" errors |
- Incorrect hostname during installation - Modified default MailCatcher values |
- Reinstall with correct FQDN - Restore default MailCatcher settings - Ensure hostname matches certificate |
Custom Mail Server Login | Magic links not arriving | - Incorrect SMTP configuration - Network connectivity issues |
- Update SMTP settings in values.yaml - Verify network allows SMTP traffic - Check mail server logs |
Invalid Token Error | "Something went wrong - invalid token" message | - Mismatched netdata-cloud-common secret- Database hash mismatch - Namespace change without secret migration |
- Migrate secret before namespace change - Perform fresh installation - Contact support for data recovery |
Warning
If you're modifying the installation namespace, the
netdata-cloud-common
secret will be recreated.Before proceeding: Back up the existing
netdata-cloud-common
secret. Alternatively, wipe the PostgreSQL database to prevent data conflicts.
When charts take a long time to load or fail with errors, the issue typically stems from data collection challenges. The charts
service must gather data from multiple Agents within a Room, requiring successful responses from all queried Agents.
Issue | Symptoms | Cause | Solution |
---|---|---|---|
Agent Connectivity | - Queries stall or timeout - Inconsistent chart loading |
Slow Agents or unreliable network connections prevent timely data collection | Deploy additional Parent nodes to provide reliable backends. The system will automatically prefer these for queries when available |
Kubernetes Resources | - Service throttling - Slow data processing - Delayed dashboard updates |
Resource saturation at the node level or restrictive container limits | Review and adjust container resource limits and node capacity as needed |
Database Performance | - Slow query responses - Increased latency across services |
PostgreSQL performance bottlenecks | Monitor and optimize database resource utilization: - CPU usage - Memory allocation - Disk I/O performance |
Message Broker | - Delayed node status updates (online/offline/stale) - Slow alert transitions - Dashboard update delays |
Message accumulation in Pulsar due to processing bottlenecks | - Review Pulsar configuration - Adjust microservice resource allocation - Monitor message processing rates |
If issues persist:
Gather the following information:
Contact support at support@netdata.cloud
.