4 years ago · 0155d31317
--- a/collectors/apps.plugin/README.md
+++ b/collectors/apps.plugin/README.md
@@ -160,7 +160,12 @@ There are a few command line options you can pass to `apps.plugin`. The list of
 
				 
			
 
				 ### Integration with eBPF
			
 
				 
			
 
				-If you don't see charts under the **eBPF syscall** or **eBPF net** sections, you should edit your [`ebpf.conf`](/collectors/ebpf.plugin/README.md#ebpf-programs) file to ensure the eBPF program is enabled.
			
 
				+If you don't see charts under the **eBPF syscall** or **eBPF net** sections, you should edit your
			
 
				+[`ebpf.conf`](/collectors/ebpf.plugin/README.md#ebpf-programs) file to ensure the eBPF program is enabled.
			
 
				+
			
 
				+Also see our [guide on troubleshooting apps with eBPF
			
 
				+metrics](/docs/guides/troubleshoot/monitor-debug-applications-ebpf.md) for ideas on how to interpret these charts in a
			
 
				+few scenarios.
			
 
				 
			
 
				 ## Permissions
			
 
				 
			
--- a/collectors/ebpf.plugin/README.md
+++ b/collectors/ebpf.plugin/README.md
@@ -16,8 +16,9 @@ variety of kernel functions, such as `do_sys_open`, `__close_fd`, `vfs_read`, `v
 
				 `return` program also monitors the return of each kernel functions to deliver more granular metrics about how your
			
 
				 system and its applications interact with the Linux kernel.
			
 
				 
			
 
				-We expect eBPF monitoring to be particularly valuable in observing and debugging how the Linux kernel handles custom
			
 
				-applications.
			
 
				+eBPF monitoring can help you troubleshoot and debug how applications interact with the Linux kernel. See our [guide on
			
 
				+troubleshooting apps with eBPF metrics](/docs/guides/troubleshoot/monitor-debug-applications-ebpf.md) for configuration
			
 
				+and troubleshooting tips.
			
 
				 
			
 
				 <figure>
			
 
				   <img src="https://user-images.githubusercontent.com/1153921/74746434-ad6a1e00-5222-11ea-858a-a7882617ae02.png" alt="An example of VFS charts, made possible by the eBPF collector plugin" />
			
--- a/docs/guides/troubleshoot/monitor-debug-applications-ebpf.md
+++ b/docs/guides/troubleshoot/monitor-debug-applications-ebpf.md
@@ -0,0 +1,269 @@
 
				+<!--
			
 
				+title: "Monitor, troubleshoot, and debug applications with eBPF metrics"
			
 
				+description: "Use Netdata's built-in eBPF metrics collector to monitor, troubleshoot, and debug your custom application using low-level kernel feedback."
			
 
				+image: /img/seo/guides/troubleshoot/monitor-debug-applications-ebpf.png
			
 
				+-->
			
 
				+
			
 
				+# Monitor, troubleshoot, and debug applications with eBPF metrics
			
 
				+
			
 
				+When trying to troubleshoot or debug a finicky application, there's no such thing as too much information. At Netdata,
			
 
				+we developed programs that connect to the [_extended Berkeley Packet Filter_ (eBPF) virtual
			
 
				+machine](/collectors/ebpf.plugin/README.md) to help you see exactly how specific applications are interacting with the
			
 
				+Linux kernel. With these charts, you can root out bugs, discover optimizations, diagnose memory leaks, and much more.
			
 
				+
			
 
				+This means you can see exactly how often, and in what volume, the application creates processes, opens files, writes to
			
 
				+filesystem using virtual filesystem (VFS) functions, and much more. Even better, the eBPF collector gathers metrics at
			
 
				+an _event frequency_, which is even faster than Netdata's beloved 1-second granularity. When you troubleshoot and debug
			
 
				+applications with eBPF, rest assured you miss not even the smallest meaningful event.
			
 
				+
			
 
				+Using this guide, you'll learn the fundamentals of setting up Netdata to give you kernel-level metrics from your
			
 
				+application so that you can monitor, troubleshoot, and debug to your heart's content.
			
 
				+
			
 
				+## Configure `apps.plugin` to recognize your custom application
			
 
				+
			
 
				+To start troubleshooting an application with eBPF metrics, you need to ensure your Netdata dashboard collects and
			
 
				+displays those metrics independent from any other process.
			
 
				+
			
 
				+You can use the `apps_groups.conf` file to configure which applications appear in charts generated by
			
 
				+[`apps.plugin`](/collectors/apps.plugin/README.md). Once you edit this file and create a new group for the application
			
 
				+you want to monitor, you can see how it's interacting with the Linux kernel via real-time eBPF metrics.
			
 
				+
			
 
				+Let's assume you have an application that runs on the process `custom-app`. To monitor eBPF metrics for that application
			
 
				+separate from any others, you need to create a new group in `apps_groups.conf` and associate that process name with it.
			
 
				+
			
 
				+Open the `apps_groups.conf` file in your Netdata configuration directory.
			
 
				+
			
 
				+```bash
			
 
				+cd /etc/netdata   # Replace this path with your Netdata config directory
			
 
				+sudo ./edit-config apps_groups.conf
			
 
				+```
			
 
				+
			
 
				+Scroll down past the explanatory comments and stop when you see `# NETDATA processes accounting`. Above that, paste in
			
 
				+the following text, which creates a new `dev` group with the `custom-app` process. Replace `custom-app` with the name of
			
 
				+your application's process name.
			
 
				+
			
 
				+Your file should now look like this:
			
 
				+
			
 
				+```conf
			
 
				+...
			
 
				+# -----------------------------------------------------------------------------
			
 
				+# Custom applications to monitor with apps.plugin and ebpf.plugin
			
 
				+
			
 
				+dev: custom-app
			
 
				+
			
 
				+# -----------------------------------------------------------------------------
			
 
				+# NETDATA processes accounting
			
 
				+...
			
 
				+```
			
 
				+
			
 
				+Restart Netdata with `sudo service netdata restart` or the appropriate method for your system to begin seeing metrics
			
 
				+for this particular group+process. You can also add additional processes to the same group.
			
 
				+
			
 
				+You can set up `apps_groups.conf` to more show more precise eBPF metrics for any application or service running on your
			
 
				+system, even if it's a standard package like Redis, Apache, or any other [application/service Netdata collects
			
 
				+from](/collectors/COLLECTORS.md).
			
 
				+
			
 
				+```conf
			
 
				+# -----------------------------------------------------------------------------
			
 
				+# Custom applications to monitor with apps.plugin and ebpf.plugin
			
 
				+
			
 
				+dev: custom-app
			
 
				+database: *redis*
			
 
				+apache: *apache*
			
 
				+
			
 
				+# -----------------------------------------------------------------------------
			
 
				+# NETDATA processes accounting
			
 
				+...
			
 
				+```
			
 
				+
			
 
				+Now that you have `apps_groups.conf` set up to monitor your application/service, you can also set up the eBPF collector
			
 
				+to show other charts that will help you debug and troubleshoot how it interacts with the Linux kernel.
			
 
				+
			
 
				+## Configure the eBPF collector to monitor errors
			
 
				+
			
 
				+The eBPF collector has [two possible modes](/collectors/ebpf.plugin#ebpf-load-mode): `entry` and `return`. The default
			
 
				+is `entry`, and only monitors calls to kernel functions, but the `return` also monitors and charts _whether these calls
			
 
				+return in error_.
			
 
				+
			
 
				+Let's turn on the `return` mode for more granularity when debugging Firefox's behavior.
			
 
				+
			
 
				+```bash
			
 
				+cd /etc/netdata   # Replace this path with your Netdata config directory
			
 
				+sudo ./edit-config ebpf.conf
			
 
				+```
			
 
				+
			
 
				+Replace `entry` with `return`:
			
 
				+
			
 
				+```conf
			
 
				+[global]
			
 
				+    ebpf load mode = return
			
 
				+    disable apps = no
			
 
				+
			
 
				+[ebpf programs]
			
 
				+    process = yes
			
 
				+    network viewer = yes
			
 
				+```
			
 
				+
			
 
				+Restart Netdata with `sudo service netdata restart` or the appropriate method for your system.
			
 
				+
			
 
				+## Get familiar with per-application eBPF metrics and charts
			
 
				+
			
 
				+Visit the Netdata dashboard at `http://NODE:19999`, replacing `NODE` with the hostname or IP of the system you're using
			
 
				+to monitor this application. Scroll down to the **Applications** section. These charts now feature a `firefox` dimension
			
 
				+with metrics specific to that process.
			
 
				+
			
 
				+Pay particular attention to the charts in the **ebpf syscall** and **ebpf net** sections. These charts are populated by
			
 
				+low-level Linux kernel metrics thanks to eBPF, and showcase the volume of calls to open/close files, call functions like
			
 
				+`do_fork`, IO activity on the VFS, and much more.
			
 
				+
			
 
				+See the [eBPF collector documentation](/collectors/ebpf.plugin/README.md#integration-with-appsplugin) for the full list
			
 
				+of per-application charts.
			
 
				+
			
 
				+Let's show some examples of how you can first identify normal eBPF patterns, then use that knowledge to idenfity
			
 
				+anomalies in a few simulated scenarios.
			
 
				+
			
 
				+For example, the following screenshot shows the number of open files, failures to open files, and closed files on a
			
 
				+Debian 10 system. The first spike is from configuring/compiling a small C program, then from running Apache's `ab` tool
			
 
				+to benchmark an Apache web server.
			
 
				+
			
 
				+![An example of eBPF
			
 
				+charts](https://user-images.githubusercontent.com/1153921/85311677-a8380c80-b46a-11ea-9735-babaedc22fdb.png)
			
 
				+
			
 
				+In these charts, you can see first a spike in syscalls to open and close files from the configure/build process,
			
 
				+followed by a similar spike from the Apache benchmark.
			
 
				+
			
 
				+> 👋 Don't forget that you can view chart data directly via Netdata's API!
			
 
				+> 
			
 
				+> For example, open your browser and navigate to `http://NODE:19999/api/v1/data?chart=apps.file_open`, replacing `NODE`
			
 
				+> with the IP address or hostname of your Agent. The API returns JSON of that chart's dimensions and metrics, which you
			
 
				+> can use in other operations.
			
 
				+>
			
 
				+> To see other charts, replace `apps.file_open` with the context of the chart you want to see data for.
			
 
				+>
			
 
				+> To see all the API options, visit our [Swagger
			
 
				+> documentation](https://editor.swagger.io/?url=https://raw.githubusercontent.com/netdata/netdata/master/web/api/netdata-swagger.yaml)
			
 
				+> and look under the **/data** section.
			
 
				+
			
 
				+## Troubleshoot and debug applications with eBPF
			
 
				+
			
 
				+The actual method of troubleshooting and debugging any application with Netdata's eBPF metrics depends on the
			
 
				+application, its place within your stack, and the type of issue you're trying to root cause. This guide won't be able to
			
 
				+explain how to troubleshoot _any_ application with eBPF metrics, but it should give you some ideas on how to start with
			
 
				+your own systems.
			
 
				+
			
 
				+The value of using Netdata to collect and visualize eBPF metrics is that you don't have to rely on existing (complex)
			
 
				+command line eBPF programs or, even worse, write your own eBPF program to get the information you need.
			
 
				+
			
 
				+Let's walk through some scenarios where you might find value in eBPF metrics.
			
 
				+
			
 
				+### Benchmark application performance
			
 
				+
			
 
				+You can use eBPF metrics to profile the performance of your applications, whether they're custom or a standard Linux
			
 
				+service, like a web server or database.
			
 
				+
			
 
				+For example, look at the charts below. The first spike represents running a Redis benchmark _without_ pipelining
			
 
				+(`redis-benchmark -n 1000000 -t set,get -q`). The second spike represents the same benchmark _with_ pipelining
			
 
				+(`redis-benchmark -n 1000000 -t set,get -q -P 16`).
			
 
				+
			
 
				+![Screenshot of eBPF metrics during a Redis
			
 
				+benchmark](https://user-images.githubusercontent.com/1153921/84916168-91607700-b072-11ea-8fec-b76df89315aa.png)
			
 
				+
			
 
				+The performance optimization is clear from the speed at which the benchmark finished (the horizontal length of the
			
 
				+spike) and the reduced write/read syscalls and bytes written to disk.
			
 
				+
			
 
				+You can run similar performance benchmarks against any application, view the results on a Linux kernel level, and
			
 
				+continuously improve the performance of your infrastructure.
			
 
				+
			
 
				+### Inspect for leaking file descriptors
			
 
				+
			
 
				+If your application runs fine and then only crashes after a few hours, leaking file descriptors may be to blame.
			
 
				+
			
 
				+Check the **Number of open files (apps.file_open)** and **Files closed (apps.file_closed)** for discrepancies. These
			
 
				+metrics should be more or less equal. If they diverge, with more open files than closed, your application may not be
			
 
				+closing file descriptors properly.
			
 
				+
			
 
				+See, for example, the volume of files opened and closed by `apps.plugin` itself. Because the eBPF collector is
			
 
				+monitoring these syscalls at an event level, you can see at any given second that the open and closed numbers as equal.
			
 
				+
			
 
				+This isn't to say Netdata is _perfect_, but at least `apps.plugin` doesn't have a file descriptor problem.
			
 
				+
			
 
				+![Screenshot of open and closed file
			
 
				+descriptors](https://user-images.githubusercontent.com/1153921/84816048-c57f5d80-afc8-11ea-9684-d2b923d5d2b2.png)
			
 
				+
			
 
				+### Pin down syscall failures
			
 
				+
			
 
				+If you enabled the eBPF collector's `return` mode as mentioned [in a previous
			
 
				+step](#configure-the-ebpf-collector-to-monitor-errors), you can view charts related to how often a given application's
			
 
				+syscalls return in failure.
			
 
				+
			
 
				+By understanding when these failures happen, and when, you might be able to diagnose a bug in your application.
			
 
				+
			
 
				+To diagnose potential issues with an application, look at the **Fails to open files (apps.file_open_error)**, **Fails to
			
 
				+close files (apps.file_close_error)**, **Fails to write (apps.vfs_write_error)**, and **Fails to read
			
 
				+(apps.vfs_read_error)** charts for failed syscalls coming from your application. If you see any, look to the surrounding
			
 
				+charts for anomalies at the same time frame, or correlate with other activity in the application or on the system to get
			
 
				+closer to the root cause.
			
 
				+
			
 
				+### Investigate zombie processes
			
 
				+
			
 
				+Look for the trio of **Process started (apps.process_create)**, **Threads started (apps.thread_create)**, and **Tasks
			
 
				+closed (apps.task_close)** charts to investigate situations where an application inadvertently leaves [zombie
			
 
				+processes](https://en.wikipedia.org/wiki/Zombie_process).
			
 
				+
			
 
				+These processes, which are terminated and don't use up system resources, can still cause issues if your system runs out
			
 
				+of available PIDs to allocate.
			
 
				+
			
 
				+For example, the chart below demonstrates a [zombie factory
			
 
				+program](https://www.refining-linux.org/archives/7-Dr.-Frankenlinux-or-how-to-create-zombie-processes.html) in action.
			
 
				+
			
 
				+![Screenshot of eBPF showing evidence of a zombie
			
 
				+process](https://user-images.githubusercontent.com/1153921/84831957-27e45800-afe1-11ea-9fe2-fdd910915366.png)
			
 
				+
			
 
				+Starting at 14:51:49, Netdata sees the `zombie` group creating one new process every second, but no closed tasks. This
			
 
				+continues for roughly 30 seconds, at which point the factory program was killed with `SIGINT`, which results in the 31
			
 
				+closed tasks in the subsequent second.
			
 
				+
			
 
				+Zombie processes may not be catastrophic, but if you're developing an application on Linux, you should eliminate them.
			
 
				+If a service in your stack creates them, you should consider filing a bug report.
			
 
				+
			
 
				+## View eBPF metrics in Netdata Cloud
			
 
				+
			
 
				+You can also show per-application eBPF metrics in Netdata Cloud. This could be particularly useful if you're running the
			
 
				+same application on multiple systems and want to correlate how it performs on each target, or if you want to share your
			
 
				+findings with someone else on your team.
			
 
				+
			
 
				+If you don't already have a Netdata Cloud account, go [sign in](https://app.netdata.cloud) and get started for free.
			
 
				+Read the [get started with Cloud guide](https://learn.netdata.cloud/docs/cloud/get-started) for a walkthrough of node
			
 
				+claiming and other fundamentals.
			
 
				+
			
 
				+Add more charts to the Nodes view by clicking on the gear icon at the far end of the table. Click on the **Context**
			
 
				+input and scroll until you find the eBPF chart you're interested in, or type in the name of the context directly. Maybe
			
 
				+something like `apps.vfs_write_call`? Next, click on the **Dimensions** input and find the `dev` dimension, or maybe
			
 
				+`apache`/`redis` if you set your `apps_groups.conf` up like the examples above.
			
 
				+
			
 
				+Click **Save** to add the chart for all your claimed nodes. Now you can see how your application interacts with multiple
			
 
				+Linux kernels on multiple Linux systems. 
			
 
				+
			
 
				+Now that you can see these metrics in Netdata Cloud, you can [invite your
			
 
				+team](https://learn.netdata.cloud/docs/cloud/collaborate/invite-your-team) and share your findings with others.
			
 
				+
			
 
				+## What's next?
			
 
				+
			
 
				+Debugging and troubleshooting an application takes a special combination of practice, experience, and sheer luck. With
			
 
				+Netdata's eBPF metrics to back you up, you can rest assured that you see every minute detail of how your application
			
 
				+interacts with the Linux kernel.
			
 
				+
			
 
				+If you're still trying to wrap your head aroud what we offer, be sure to read up on our accompanying documentation and
			
 
				+other resources on eBPF monitoring with Netdata:
			
 
				+
			
 
				+-   [eBPF collector](/collectors/ebpf.plugin/README.md)
			
 
				+-   [eBPF's integration with `apps.plugin`](/collectors/apps.plugin/README.md#integration-with-ebpf)
			
 
				+-   [Linux eBPF monitoring with Netdata](https://www.netdata.cloud/blog/linux-ebpf-monitoring-with-netdata/)
			
 
				+
			
 
				+The scenarios described above are just the beginning when it comes to troubleshooting with eBPF metrics. We're excited
			
 
				+to explore others and see what our community dreams up. If you have other use cases, whether simulated or real-world,
			
 
				+we'd love to hear them: [info@netdata.cloud](mailto:info@netdata.cloud).
			
 
				+
			
 
				+Happy troubleshooting!
			
 
				+
			
 
				+[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fguides%troubleshoot%2Fmonitor-debug-applications-ebpf.md&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>)