kubernetes-k8s-netdata.txt 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234
  1. # Kubernetes monitoring with Netdata
  2. This document gives an overview of what visualizations Netdata provides on Kubernetes deployments.
  3. At Netdata, we've built Kubernetes monitoring tools that add visibility without complexity while also helping you
  4. actively troubleshoot anomalies or outages. This guide walks you through each of the visualizations and offers best
  5. practices on how to use them to start Kubernetes monitoring in a matter of minutes, not hours or days.
  6. Netdata's Kubernetes monitoring solution uses a handful of [complementary tools and
  7. collectors](#related-reference-documentation) for peeling back the many complex layers of a Kubernetes cluster,
  8. _entirely for free_. These methods work together to give you every metric you need to troubleshoot performance or
  9. availability issues across your Kubernetes infrastructure.
  10. ## Challenge
  11. While Kubernetes (k8s) might simplify the way you deploy, scale, and load-balance your applications, not all clusters
  12. come with "batteries included" when it comes to monitoring. Doubly so for a monitoring stack that helps you actively
  13. troubleshoot issues with your cluster.
  14. Some k8s providers, like GKE (Google Kubernetes Engine), do deploy clusters bundled with monitoring capabilities, such
  15. as Google Stackdriver Monitoring. However, these pre-configured solutions might not offer the depth of metrics,
  16. customization, or integration with your preferred alerting methods.
  17. Without this visibility, it's like you built an entire house and _then_ smashed your way through the finished walls to
  18. add windows.
  19. ## Solution
  20. In this tutorial, you'll learn how to navigate Netdata's Kubernetes monitoring features, using
  21. [robot-shop](https://github.com/instana/robot-shop) as an example deployment. Deploying robot-shop is purely optional.
  22. You can also follow along with your own Kubernetes deployment if you choose. While the metrics might be different, the
  23. navigation and best practices are the same for every cluster.
  24. ## What you need to get started
  25. To follow this tutorial, you need:
  26. - A free Netdata Cloud account. [Sign up](https://app.netdata.cloud/sign-up?cloudRoute=/spaces) if you don't have one
  27. already.
  28. - A working cluster running Kubernetes v1.9 or newer, with a Netdata deployment and connected parent/child nodes. See
  29. our [Kubernetes deployment process](/packaging/installer/methods/kubernetes.md) for details on deployment and
  30. connecting to Cloud.
  31. - The [`kubectl`](https://kubernetes.io/docs/reference/kubectl/overview/) command line tool, within [one minor version
  32. difference](https://kubernetes.io/docs/tasks/tools/install-kubectl/#before-you-begin) of your cluster, on an
  33. administrative system.
  34. - The [Helm package manager](https://helm.sh/) v3.0.0 or newer on the same administrative system.
  35. ### Install the `robot-shop` demo (optional)
  36. Begin by downloading the robot-shop code and using `helm` to create a new deployment.
  37. ```bash
  38. git clone git@github.com:instana/robot-shop.git
  39. cd robot-shop/K8s/helm
  40. kubectl create ns robot-shop
  41. helm install robot-shop --namespace robot-shop .
  42. ```
  43. Running `kubectl get pods` shows both the Netdata and robot-shop deployments.
  44. ```bash
  45. kubectl get pods --all-namespaces
  46. NAMESPACE NAME READY STATUS RESTARTS AGE
  47. default netdata-child-29f9c 2/2 Running 0 10m
  48. default netdata-child-8xphf 2/2 Running 0 10m
  49. default netdata-child-jdvds 2/2 Running 0 11m
  50. default netdata-parent-554c755b7d-qzrx4 1/1 Running 0 11m
  51. kube-system aws-node-jnjv8 1/1 Running 0 17m
  52. kube-system aws-node-svzdb 1/1 Running 0 17m
  53. kube-system aws-node-ts6n2 1/1 Running 0 17m
  54. kube-system coredns-559b5db75d-f58hp 1/1 Running 0 22h
  55. kube-system coredns-559b5db75d-tkzj2 1/1 Running 0 22h
  56. kube-system kube-proxy-9p9cd 1/1 Running 0 17m
  57. kube-system kube-proxy-lt9ss 1/1 Running 0 17m
  58. kube-system kube-proxy-n75t9 1/1 Running 0 17m
  59. robot-shop cart-b4bbc8fff-t57js 1/1 Running 0 14m
  60. robot-shop catalogue-8b5f66c98-mr85z 1/1 Running 0 14m
  61. robot-shop dispatch-67d955c7d8-lnr44 1/1 Running 0 14m
  62. robot-shop mongodb-7f65d86c-dsslc 1/1 Running 0 14m
  63. robot-shop mysql-764c4c5fc7-kkbnf 1/1 Running 0 14m
  64. robot-shop payment-67c87cb7d-5krxv 1/1 Running 0 14m
  65. robot-shop rabbitmq-5bb66bb6c9-6xr5b 1/1 Running 0 14m
  66. robot-shop ratings-94fd9c75b-42wvh 1/1 Running 0 14m
  67. robot-shop redis-0 0/1 Pending 0 14m
  68. robot-shop shipping-7d69cb88b-w7hpj 1/1 Running 0 14m
  69. robot-shop user-79c445b44b-hwnm9 1/1 Running 0 14m
  70. robot-shop web-8bb887476-lkcjx 1/1 Running 0 14m
  71. ```
  72. ## Explore Netdata's Kubernetes monitoring charts
  73. The Netdata Helm chart deploys and enables everything you need for monitoring Kubernetes on every layer. Once you deploy
  74. Netdata and connect your cluster's nodes, you're ready to check out the visualizations **with zero configuration**.
  75. To get started, [sign in](https://app.netdata.cloud/sign-in?cloudRoute=/spaces) to your Netdata Cloud account. Head over
  76. to the Room you connected your cluster to, if not **General**.
  77. Let's walk through monitoring each layer of a Kubernetes cluster using the Overview as our framework.
  78. ## Cluster and node metrics
  79. The gauges and time-series charts you see right away in the Overview show aggregated metrics from every node in your
  80. cluster.
  81. For example, the `apps.cpu` chart (in the **Applications** menu item), visualizes the CPU utilization of various
  82. applications/services running on each of the nodes in your cluster. The **X Nodes** dropdown shows which nodes
  83. contribute to the chart and links to jump a single-node dashboard for further investigation.
  84. ![Per-application monitoring in a Kubernetes
  85. cluster](https://user-images.githubusercontent.com/1153921/109042169-19c8fa00-768d-11eb-91a7-1a7afc41fea2.png)
  86. For example, the chart above shows a spike in the CPU utilization from `rabbitmq` every minute or so, along with a
  87. baseline CPU utilization of 10-15% across the cluster.
  88. ## Pod and container metrics
  89. Click on the **Kubernetes xxxxxxx...** section to jump down to Netdata Cloud's unique Kubernetes visualizations for view
  90. real-time resource utilization metrics from your Kubernetes pods and containers.
  91. ![Navigating to the Kubernetes monitoring
  92. visualizations](https://user-images.githubusercontent.com/1153921/109049195-349f6c80-7695-11eb-8902-52a029dca77f.png)
  93. ### Health map
  94. The first visualization is the [health map](/docs/dashboards-and-charts/kubernetes-tab.md#health-map),
  95. which places each container into its own box, then varies the intensity of their color to visualize the resource
  96. utilization. By default, the health map shows the **average CPU utilization as a percentage of the configured limit**
  97. for every container in your cluster.
  98. ![The Kubernetes health map in Netdata
  99. Cloud](https://user-images.githubusercontent.com/1153921/109050085-3f0e3600-7696-11eb-988f-52cb187f53ea.png)
  100. Let's explore the most colorful box by hovering over it.
  101. ![Hovering over a
  102. container](https://user-images.githubusercontent.com/1153921/109049544-a8417980-7695-11eb-80a7-109b4a645a27.png)
  103. The **Context** tab shows `rabbitmq-5bb66bb6c9-6xr5b` as the container's image name, which means this container is
  104. running a [RabbitMQ](/src/go/plugin/go.d/collector/rabbitmq/README.md) workload.
  105. Click the **Metrics** tab to see real-time metrics from that container. Unsurprisingly, it shows a spike in CPU
  106. utilization at regular intervals.
  107. ![Viewing real-time container
  108. metrics](https://user-images.githubusercontent.com/1153921/109050482-aa580800-7696-11eb-9e3e-d3bdf0f3eff7.png)
  109. ### Time-series charts
  110. Beneath the health map is a variety of time-series charts that help you visualize resource utilization over time, which
  111. is useful for targeted troubleshooting.
  112. The default is to display metrics grouped by the `k8s_namespace` label, which shows resource utilization based on your
  113. different namespaces.
  114. ![Time-series Kubernetes monitoring in Netdata
  115. Cloud](https://user-images.githubusercontent.com/1153921/109075210-126a1680-76b6-11eb-918d-5acdcdac152d.png)
  116. Each composite chart has a [definition bar](/docs/dashboards-and-charts/netdata-charts.md#definition-bar)
  117. for complete customization. For example, grouping the top chart by `k8s_container_name` reveals new information.
  118. ![Changing time-series charts](https://user-images.githubusercontent.com/1153921/109075212-139b4380-76b6-11eb-836f-939482ae55fc.png)
  119. ## Service metrics
  120. Netdata has a [service discovery plugin](https://github.com/netdata/agent-service-discovery), which discovers and
  121. creates configuration files for [compatible
  122. services](https://github.com/netdata/helmchart#service-discovery-and-supported-services) and any endpoints covered by
  123. our [generic Prometheus collector](/src/go/plugin/go.d/collector/prometheus/README.md).
  124. Netdata uses these files to collect metrics from any compatible application as they run _inside_ of a pod. Service
  125. discovery happens without manual intervention as pods are created, destroyed, or moved between nodes.
  126. Service metrics show up on the Overview as well, beneath the **Kubernetes** section, and are labeled according to the
  127. service in question. For example, the **RabbitMQ** section has numerous charts from the [`rabbitmq`
  128. collector](/src/go/plugin/go.d/collector/rabbitmq/README.md):
  129. ![Finding service discovery
  130. metrics](https://user-images.githubusercontent.com/1153921/109054511-2eac8a00-769b-11eb-97f1-da93acb4b5fe.png)
  131. > The robot-shop cluster has more supported services, such as MySQL, which are not visible with zero configuration. This
  132. > is usually because of services running on non-default ports, using non-default names, or required passwords. Read up
  133. > on [configuring service discovery](/packaging/installer/methods/kubernetes.md#configure-service-discovery) to collect
  134. > more service metrics.
  135. Service metrics are essential to infrastructure monitoring, as they're the best indicator of the end-user experience,
  136. and key signals for troubleshooting anomalies or issues.
  137. ## Kubernetes components
  138. Netdata also automatically collects metrics from two essential Kubernetes processes.
  139. ### kubelet
  140. The **k8s kubelet** section visualizes metrics from the Kubernetes agent responsible for managing every pod on a given
  141. node. This also happens without any configuration thanks to the [kubelet
  142. collector](/src/go/plugin/go.d/collector/k8s_kubelet/README.md).
  143. Monitoring each node's kubelet can be invaluable when diagnosing issues with your Kubernetes cluster. For example, you
  144. can see if the number of running containers/pods has dropped, which could signal a fault or crash in a particular
  145. Kubernetes service or deployment (see `kubectl get services` or `kubectl get deployments` for more details). If the
  146. number of pods increases, it may be because of something more benign, like another team member scaling up a
  147. service with `kubectl scale`.
  148. You can also view charts for the Kubelet API server, the volume of runtime/Docker operations by type,
  149. configuration-related errors, and the actual vs. desired numbers of volumes, plus a lot more.
  150. ### kube-proxy
  151. The **k8s kube-proxy** section displays metrics about the network proxy that runs on each node in your Kubernetes
  152. cluster. kube-proxy lets pods communicate with each other and accept sessions from outside your cluster. Its metrics are
  153. collected by the [kube-proxy
  154. collector](/src/go/plugin/go.d/collector/k8s_kubeproxy/README.md).
  155. With Netdata, you can monitor how often your k8s proxies are syncing proxy rules between nodes. Dramatic changes in
  156. these figures could indicate an anomaly in your cluster that's worthy of further investigation.
  157. ## What's next?
  158. After reading this guide, you should now be able to monitor any Kubernetes cluster with Netdata, including nodes, pods,
  159. containers, services, and more.
  160. With the health map, time-series charts, and the ability to drill down into individual nodes, you can see hundreds of
  161. per-second metrics with zero configuration and less time remembering all the `kubectl` options. Netdata moves with your
  162. cluster, automatically picking up new nodes or services as your infrastructure scales. And it's entirely free for
  163. clusters of all sizes.
  164. ### Related reference documentation
  165. - [Netdata Helm chart](https://github.com/netdata/helmchart)
  166. - [Netdata service discovery](https://github.com/netdata/agent-service-discovery)
  167. - [Netdata Agent · `kubelet`
  168. collector](/src/go/plugin/go.d/collector/k8s_kubelet/README.md)
  169. - [Netdata Agent · `kube-proxy`
  170. collector](/src/go/plugin/go.d/collector/k8s_kubeproxy/README.md)
  171. - [Netdata Agent · `cgroups.plugin`](/src/collectors/cgroups.plugin/README.md)