Просмотр исходного кода

Update docs on metric storage (#13327)

This PR 

- Explains the new tiering mechanism.
- Housekeeping docs about Agent's database options.
- Updates all the configuration options for the `dbengine`.
- Provide a new way for the users to calculate the space they need for their 
  metric storage needs (via a spreadsheet)

Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>
Co-authored-by: DShreve2 <david@netdata.cloud>
Tasos Katsoulas 2 лет назад
Родитель
Сommit
bc5ba4f891

+ 27 - 15
daemon/config/README.md

@@ -82,21 +82,33 @@ Please note that your data history will be lost if you have modified `history` p
 
 ### [db] section options
 
-|              setting               |  default   | info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
-|:----------------------------------:|:----------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-|                mode                | `dbengine` | `dbengine`: The default for long-term metrics storage with efficient RAM and disk usage. Can be extended with `dbengine page cache size MB` and `dbengine disk space MB`. <br />`save`: Netdata will save its round robin database on exit and load it on startup. <br />`map`: Cache files will be updated in real-time. Not ideal for systems with high load or slow disks (check `man mmap`). <br />`ram`: The round-robin database will be temporary and it will be lost when Netdata exits. <br />`none`: Disables the database at this host, and disables health monitoring entirely, as that requires a database of metrics. |
-|             retention              |   `3600`   | Used with `mode = save/map/ram/alloc`, not the default `mode = dbengine`. This number reflects the number of entries the `netdata` daemon will by default keep in memory for each chart dimension. Check [Memory Requirements](/database/README.md) for more information.                                                                                                                                                                                                                                                                                                                                                           |
-|            update every            |    `1`     | The frequency in seconds, for data collection. For more information see the [performance guide](/docs/guides/configure/performance.md).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
-|    dbengine page cache size MB     |     32     | Determines the amount of RAM in MiB that is dedicated to caching Netdata metric values.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
-|       dbengine disk space MB       |    256     | Determines the amount of disk space in MiB that is dedicated to storing Netdata metric values and all related metadata describing them.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
-|  dbengine multihost disk space MB  |    256     | Same functionality as `dbengine disk space MB`, but includes support for storing metrics streamed to a parent node by its children. Can be used in single-node environments as well.                                                                                                                                                                                                                                                                                                                                                                                                                                                |
-|     memory deduplication (ksm)     |   `yes`    | When set to `yes`, Netdata will offer its in-memory round robin database and the dbengine page cache to kernel same page merging (KSM) for deduplication. For more information check [Memory Deduplication - Kernel Same Page Merging - KSM](/database/README.md#ksm)                                                                                                                                                                                                                                                                                                                                                               |
-| cleanup obsolete charts after secs |   `3600`   | See [monitoring ephemeral containers](/collectors/cgroups.plugin/README.md#monitoring-ephemeral-containers), also sets the timeout for cleaning up obsolete dimensions                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
-|   gap when lost iterations above   |    `1`     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
-|  cleanup orphan hosts after secs   |   `3600`   | How long to wait until automatically removing from the DB a remote Netdata host (child) that is no longer sending data.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
-|    delete obsolete charts files    |   `yes`    | See [monitoring ephemeral containers](/collectors/cgroups.plugin/README.md#monitoring-ephemeral-containers), also affects the deletion of files for obsolete dimensions                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
-|     delete orphan hosts files      |   `yes`    | Set to `no` to disable non-responsive host removal.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
-|        enable zero metrics         |    `no`    | Set to `yes` to show charts when all their metrics are zero.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+|                    setting                    |  default   | info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+|:---------------------------------------------:|:----------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+|                     mode                      | `dbengine` | `dbengine`: The default for long-term metrics storage with efficient RAM and disk usage. Can be extended with `dbengine page cache size MB` and `dbengine disk space MB`. <br />`save`: Netdata will save its round robin database on exit and load it on startup. <br />`map`: Cache files will be updated in real-time. Not ideal for systems with high load or slow disks (check `man mmap`). <br />`ram`: The round-robin database will be temporary and it will be lost when Netdata exits. <br />`none`: Disables the database at this host, and disables health monitoring entirely, as that requires a database of metrics. |
+|                   retention                   |   `3600`   | Used with `mode = save/map/ram/alloc`, not the default `mode = dbengine`. This number reflects the number of entries the `netdata` daemon will by default keep in memory for each chart dimension. Check [Memory Requirements](/database/README.md) for more information.                                                                                                                                                                                                                                                                                                                                                           |
+|                 storage tiers                 |    `1`     | The number of storage tiers you want to have in your dbengine. Check the tiering mechanism in the [dbengine's reference](/database/engine/README.md#tiering). You can have up to 5 tiers of data (including the _Tier 0_). This number ranges between 1 and 5.                                                                                                                                                                                                                                                                                                                                                                      |
+|          dbengine page cache size MB          |    `32`    | Determines the amount of RAM in MiB that is dedicated to caching for _Tier 0_ Netdata metric values.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+|   dbengine tier **`N`** page cache size MB    |    `32`    | Determines the amount of RAM in MiB that is dedicated for caching Netdata metric values of the **`N`** tier. <br /> `N belongs to [1..4]`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ||
+ |            dbengine disk space MB             |   `256`    | Determines the amount of disk space in MiB that is dedicated to storing _Tier 0_ Netdata metric values and all related metadata describing them. This option is available **only for legacy configuration** (`Agent v1.23.2 and prior`).                                                                                                                                                                                                                                                                                                                                                                                            |
+|       dbengine multihost disk space MB        |   `256`    | Same functionality as `dbengine disk space MB`, but includes support for storing metrics streamed to a parent node by its children. Can be used in single-node environments as well. This setting is only for _Tier 0_ metrics.                                                                                                                                                                                                                                                                                                                                                                                                     |
+| dbengine tier **`N`** multihost disk space MB |   `256`    | Same functionality as `dbengine multihost disk space MB`, but stores metrics of the **`N`** tier (both parent node and its children). Can be used in single-node environments as well. <br /> `N belongs to [1..4]`                                                                                                                                                                                                                                                                                                                                                                                                                 |
+|                 update every                  |    `1`     | The frequency in seconds, for data collection. For more information see the [performance guide](/docs/guides/configure/performance.md). These metrics stored as _Tier 0_ data. Explore the tiering mechanism in the [dbengine's reference](/database/engine/README.md#tiering).                                                                                                                                                                                                                                                                                                                                                     |
+| dbengine tier **`N`** update every iterations |    `60`    | The down sampling value of each tier from the previous one. For each Tier, the greater by one Tier has N (equal to 60 by default) less data points of any metric it collects. This setting can take values from `2` up to `255`. <br /> `N belongs to [1..4]`                                                                                                                                                                                                                                                                                                                                                                       |
+|        dbengine tier **`N`** back fill        |   `New`    | Specifies the strategy of recreating missing data on each Tier from the exact lower Tier. <br /> `New`: Sees the latest point on each Tier and save new points to it only if the exact lower Tier has available points for it's observation window (`dbengine tier N update every iterations` window). <br /> `none`: No back filling is applied. <br /> `N belongs to [1..4]`                                                                                                                                                                                                                                                      |
+|          memory deduplication (ksm)           |   `yes`    | When set to `yes`, Netdata will offer its in-memory round robin database and the dbengine page cache to kernel same page merging (KSM) for deduplication. For more information check [Memory Deduplication - Kernel Same Page Merging - KSM](/database/README.md#ksm)                                                                                                                                                                                                                                                                                                                                                               |
+|      cleanup obsolete charts after secs       |   `3600`   | See [monitoring ephemeral containers](/collectors/cgroups.plugin/README.md#monitoring-ephemeral-containers), also sets the timeout for cleaning up obsolete dimensions                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+|        gap when lost iterations above         |    `1`     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+|        cleanup orphan hosts after secs        |   `3600`   | How long to wait until automatically removing from the DB a remote Netdata host (child) that is no longer sending data.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+|         delete obsolete charts files          |   `yes`    | See [monitoring ephemeral containers](/collectors/cgroups.plugin/README.md#monitoring-ephemeral-containers), also affects the deletion of files for obsolete dimensions                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+|           delete orphan hosts files           |   `yes`    | Set to `no` to disable non-responsive host removal.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+|              enable zero metrics              |    `no`    | Set to `yes` to show charts when all their metrics are zero.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+
+:::info
+
+The multiplication of all the **enabled** tiers  `dbengine tier N update every iterations` values  must be less than `65535`.
+
+:::
+
 
 ### [directories] section options
 

+ 76 - 137
database/README.md

@@ -7,199 +7,138 @@ custom_edit_url: https://github.com/netdata/netdata/edit/master/database/README.
 # Database
 
 Netdata is fully capable of long-term metrics storage, at per-second granularity, via its default database engine
-(`dbengine`). But to remain as flexible as possible, Netdata supports a number of types of metrics storage:
+(`dbengine`). But to remain as flexible as possible, Netdata supports several storage options:
 
 1. `dbengine`, (the default) data are in database files. The [Database Engine](/database/engine/README.md) works like a
-    traditional database. There is some amount of RAM dedicated to data caching and indexing and the rest of the data
-    reside compressed on disk. The number of history entries is not fixed in this case, but depends on the configured
-    disk space and the effective compression ratio of the data stored. This is the **only mode** that supports changing
-    the data collection update frequency (`update every`) **without losing** the previously stored metrics. For more
-    details see [here](/database/engine/README.md).
+   traditional database. There is some amount of RAM dedicated to data caching and indexing and the rest of the data
+   reside compressed on disk. The number of history entries is not fixed in this case, but depends on the configured
+   disk space and the effective compression ratio of the data stored. This is the **only mode** that supports changing
+   the data collection update frequency (`update every`) **without losing** the previously stored metrics. For more
+   details see [here](/database/engine/README.md).
 
-2.  `ram`, data are purely in memory. Data are never saved on disk. This mode uses `mmap()` and supports [KSM](#ksm).
+2. `ram`, data are purely in memory. Data are never saved on disk. This mode uses `mmap()` and supports [KSM](#ksm).
 
-3.  `save`, data are only in RAM while Netdata runs and are saved to / loaded from disk on Netdata
-    restart. It also uses `mmap()` and supports [KSM](#ksm).
+3. `save`, data are only in RAM while Netdata runs and are saved to / loaded from disk on Netdata restart. It also
+   uses `mmap()` and supports [KSM](#ksm).
 
-4.  `map`, data are in memory mapped files. This works like the swap. Keep in mind though, this will have a constant
-    write on your disk. When Netdata writes data on its memory, the Linux kernel marks the related memory pages as dirty
-    and automatically starts updating them on disk. Unfortunately we cannot control how frequently this works. The Linux
-    kernel uses exactly the same algorithm it uses for its swap memory. Check below for additional information on
-    running a dedicated central Netdata server. This mode uses `mmap()` but does not support [KSM](#ksm).
+4. `map`, data are in memory mapped files. This works like the swap. When Netdata writes data on its memory, the Linux
+   kernel marks the related memory pages as dirty and automatically starts updating them on disk. Unfortunately we
+   cannot control how frequently this works. The Linux kernel uses exactly the same algorithm it uses for its swap
+   memory. This mode uses `mmap()` but does not support [KSM](#ksm). _Keep in mind though, this option will have a
+   constant write on your disk._
 
-5.  `none`, without a database (collected metrics can only be streamed to another Netdata).
+5. `alloc`, like `ram` but it uses `calloc()` and does not support [KSM](#ksm). This mode is the fallback for all others
+   except `none`.
 
-6.  `alloc`, like `ram` but it uses `calloc()` and does not support [KSM](#ksm). This mode is the fallback for all
-    others except `none`.
+6. `none`, without a database (collected metrics can only be streamed to another Netdata).
 
-You can select the database mode by editing `netdata.conf` and setting:
-
-```conf
-[db]
-  # dbengine (default), ram, save (the default if dbengine not available), map (swap like), none, alloc
-  mode = dbengine
-```
-
-## Running Netdata in embedded devices
-
-Embedded devices usually have very limited RAM resources available.
-
-There are 2 settings for you to tweak:
-
-1.  `[db].update every`, which controls the data collection frequency
-2.  `[db].retention`, which controls the size of the database in memory (except for `[db].mode = dbengine`)
-
-By default `[db].update every = 1` and `[db].retention = 3600`. This gives you an hour of data with per second updates.
-
-If you set `[db].update every = 2` and `[db].retention = 1800`, you will still have an hour of data, but collected once every 2
-seconds. This will **cut in half** both CPU and RAM resources consumed by Netdata. Of course experiment a bit. On very
-weak devices you might have to use `[db].update every = 5` and `[db].retention = 720` (still 1 hour of data, but 1/5 of the CPU and
-RAM resources).
-
-You can also disable [data collection plugins](/collectors/README.md) you don't need. Disabling such plugins will also free both
-CPU and RAM resources.
-
-## Running a dedicated parent Netdata server
-
-Netdata allows streaming data between Netdata nodes in real-time. This allows having one or more parent Netdata servers that will maintain
-the entire database for all the nodes that connect to them (their children), and will also run health checks/alarms for all these nodes.
-
-### map
+## Which database mode to use
 
-In this mode, the database of Netdata is stored in memory mapped files. Netdata continues to read and write the database
-in memory, but the kernel automatically loads and saves memory pages from/to disk.
+The default mode `[db].mode = dbengine` has been designed to scale for longer retentions and is the only mode suitable
+for parent Agents in the _Parent - Child_ setups
 
-**We suggest _not_ to use this mode on nodes that run other applications.** There will always be dirty memory to be
-synced and this syncing process may influence the way other applications work. This mode however is useful when we need
-a parent Netdata server that would normally need huge amounts of memory.
+The other available database modes are designed to minimize resource utilization and should only be considered on
+[Parent - Child](/docs/metrics-storage-management/how-streaming-works) setups at the children side and only when the
+resource constraints are very strict.
 
-There are a few kernel options that provide finer control on the way this syncing works. But before explaining them, a
-brief introduction of how Netdata database works is needed.
+So,
 
-For each chart, Netdata maps the following files:
+- On a single node setup, use `[db].mode = dbengine`.
+- On a [Parent - Child](/docs/metrics-storage-management/how-streaming-works) setup, use `[db].mode = dbengine` on the
+  parent to increase retention, a more resource efficient mode like, `dbengine` with light retention settings, and
+  `save`, `ram` or `none` modes for the children to minimize resource utilization.
 
-1.  `chart/main.db`, this is the file that maintains chart information. Every time data are collected for a chart, this
-    is updated.
-2.  `chart/dimension_name.db`, this is the file for each dimension. At its beginning there is a header, followed by the
-    round robin database where metrics are stored.
+## Choose your database mode
 
-So, every time Netdata collects data, the following pages will become dirty:
-
-1.  the chart file
-2.  the header part of all dimension files
-3.  if the collected metrics are stored far enough in the dimension file, another page will become dirty, for each
-    dimension
-
-Each page in Linux is 4KB. So, with 200 charts and 1000 dimensions, there will be 1200 to 2200 4KB pages dirty pages
-every second. Of course 1200 of them will always be dirty (the chart header and the dimensions headers) and 1000 will be
-dirty for about 1000 seconds (4 bytes per metric, 4KB per page, so 1000 seconds, or 16 minutes per page).
-
-Hopefully, the Linux kernel does not sync all these data every second. The frequency they are synced is controlled by
-`/proc/sys/vm/dirty_expire_centisecs` or the `sysctl` `vm.dirty_expire_centisecs`. The default on most systems is 3000
-(30 seconds).
-
-On a busy server centralizing metrics from 20+ servers you will experience this:
-
-![image](https://cloud.githubusercontent.com/assets/2662304/23834750/429ab0dc-0764-11e7-821a-d7908bc881ac.png)
-
-As you can see, there is quite some stress (this is `iowait`) every 30 seconds.
-
-A simple solution is to increase this time to 10 minutes (60000). This is the same system with this setting in 10
-minutes:
-
-![image](https://cloud.githubusercontent.com/assets/2662304/23834784/d2304f72-0764-11e7-8389-fb830ffd973a.png)
-
-Of course, setting this to 10 minutes means that data on disk might be up to 10 minutes old if you get an abnormal
-shutdown.
+You can select the database mode by editing `netdata.conf` and setting:
 
-There are 2 more options to tweak:
+```conf
+[db]
+  # dbengine (default), ram, save (the default if dbengine not available), map (swap like), none, alloc
+  mode = dbengine
+```
 
-1.  `dirty_background_ratio`, by default `10`.
-2.  `dirty_ratio`, by default `20`.
+## Netdata Longer Metrics Retention
 
-These control the amount of memory that should be dirty for disk syncing to be triggered. On dedicated Netdata servers,
-you can use: `80` and `90` respectively, so that all RAM is given to Netdata.
+Metrics retention is controlled only by the disk space allocated to storing metrics. But it also affects the memory and
+CPU required by the agent to query longer timeframes.
 
-With these settings, you can expect a little `iowait` spike once every 10 minutes and in case of system crash, data on
-disk will be up to 10 minutes old.
+Since Netdata Agents usually run on the edge, on production systems, Netdata Agent **parents** should be considered.
+When having a [**parent - child**](/docs/metrics-storage-management/how-streaming-works.md) setup, the child (the
+Netdata Agent running on a production system) delegates all of its functions, including longer metrics retention and
+querying, to the parent node that can dedicate more resources to this task. A single Netdata Agent parent can centralize
+multiple children Netdata Agents (dozens, hundreds, or even thousands depending on its available resources).
 
-![image](https://cloud.githubusercontent.com/assets/2662304/23835030/ba4bf506-0768-11e7-9bc6-3b23e080c69f.png)
+## Running Netdata on embedded devices
 
-To have these settings automatically applied on boot, create the file `/etc/sysctl.d/netdata-memory.conf` with these
-contents:
+Embedded devices typically have very limited RAM resources available.
 
-```conf
-vm.dirty_expire_centisecs = 60000
-vm.dirty_background_ratio = 80
-vm.dirty_ratio = 90
-vm.dirty_writeback_centisecs = 0
-```
+There are two settings for you to configure:
 
-There is another mode to help overcome the memory size problem. What is **most interesting for this setup** is
-`[db].mode = dbengine`.
+1. `[db].update every`, which controls the data collection frequency
+2. `[db].retention`, which controls the size of the database in memory (except for `[db].mode = dbengine`)
 
-### dbengine
+By default `[db].update every = 1` and `[db].retention = 3600`. This gives you an hour of data with per second updates.
 
-In this mode, the database of Netdata is stored in database files. The [Database Engine](/database/engine/README.md)
-works like a traditional database. There is some amount of RAM dedicated to data caching and indexing and the rest of
-the data reside compressed on disk. The number of history entries is not fixed in this case, but depends on the
-configured disk space and the effective compression ratio of the data stored.
+If you set `[db].update every = 2` and `[db].retention = 1800`, you will still have an hour of data, but collected once
+every 2 seconds. This will **cut in half** both CPU and RAM resources consumed by Netdata. Of course experiment a bit to find the right setting.
+On very weak devices you might have to use `[db].update every = 5` and `[db].retention = 720` (still 1 hour of data, but
+1/5 of the CPU and RAM resources).
 
-We suggest to use **this** mode on nodes that also run other applications. The Database Engine uses direct I/O to avoid
-polluting the OS filesystem caches and does not generate excessive I/O traffic so as to create the minimum possible
-interference with other applications. Using mode `dbengine` we can overcome most memory restrictions. For more
-details see [here](/database/engine/README.md).
+You can also disable [data collection plugins](/collectors/README.md) that you don't need. Disabling such plugins will also
+free both CPU and RAM resources.
 
-## KSM
+## Memory optimizations
 
-Netdata offers all its in-memory database to kernel for deduplication.
+### KSM
 
-In the past KSM has been criticized for consuming a lot of CPU resources. Although this is true when KSM is used for
-deduplicating certain applications, it is not true with netdata, since the Netdata memory is written very infrequently
-(if you have 24 hours of metrics in netdata, each byte at the in-memory database will be updated just once per day).
+KSM performs memory deduplication by scanning through main memory for physical pages that have identical content, and
+identifies the virtual pages that are mapped to those physical pages. It leaves one page unchanged, and re-maps each
+duplicate page to point to the same physical page. Netdata offers all of its in-memory database to kernel for
+deduplication.
 
-KSM is a solution that will provide 60+% memory savings to Netdata.
+In the past, KSM has been criticized for consuming a lot of CPU resources. This is true when KSM is used for
+deduplicating certain applications, but it is not true for Netdata. Agent's memory is written very infrequently
+(if you have 24 hours of metrics in Netdata, each byte at the in-memory database will be updated just once per day). KSM
+is a solution that will provide 60+% memory savings to Netdata.
 
 ### Enable KSM in kernel
 
-You need to run a kernel compiled with:
+To enable KSM in kernel, you need to run a kernel compiled with the following:
 
 ```sh
 CONFIG_KSM=y
 ```
 
-When KSM is enabled at the kernel is just available for the user to enable it.
+When KSM is enabled at the kernel, it is just available for the user to enable it.
 
-So, if you build a kernel with `CONFIG_KSM=y` you will just get a few files in `/sys/kernel/mm/ksm`. Nothing else
-happens. There is no performance penalty (apart I guess from the memory this code occupies into the kernel).
+If you build a kernel with `CONFIG_KSM=y`, you will just get a few files in `/sys/kernel/mm/ksm`. Nothing else
+happens. There is no performance penalty (apart from the memory this code occupies into the kernel).
 
 The files that `CONFIG_KSM=y` offers include:
 
--   `/sys/kernel/mm/ksm/run` by default `0`. You have to set this to `1` for the
-    kernel to spawn `ksmd`.
--   `/sys/kernel/mm/ksm/sleep_millisecs`, by default `20`. The frequency ksmd
-    should evaluate memory for deduplication.
--   `/sys/kernel/mm/ksm/pages_to_scan`, by default `100`. The amount of pages
-    ksmd will evaluate on each run.
+- `/sys/kernel/mm/ksm/run` by default `0`. You have to set this to `1` for the kernel to spawn `ksmd`.
+- `/sys/kernel/mm/ksm/sleep_millisecs`, by default `20`. The frequency ksmd should evaluate memory for deduplication.
+- `/sys/kernel/mm/ksm/pages_to_scan`, by default `100`. The amount of pages ksmd will evaluate on each run.
 
 So, by default `ksmd` is just disabled. It will not harm performance and the user/admin can control the CPU resources
-he/she is willing `ksmd` to use.
+they are willing to have used by `ksmd`.
 
 ### Run `ksmd` kernel daemon
 
-To activate / run `ksmd` you need to run:
+To activate / run `ksmd,` you need to run the following:
 
 ```sh
 echo 1 >/sys/kernel/mm/ksm/run
 echo 1000 >/sys/kernel/mm/ksm/sleep_millisecs
 ```
 
-With these settings ksmd does not even appear in the running process list (it will run once per second and evaluate 100
+With these settings, ksmd does not even appear in the running process list (it will run once per second and evaluate 100
 pages for de-duplication).
 
 Put the above lines in your boot sequence (`/etc/rc.local` or equivalent) to have `ksmd` run at boot.
 
-## Monitoring Kernel Memory de-duplication performance
+### Monitoring Kernel Memory de-duplication performance
 
 Netdata will create charts for kernel memory de-duplication performance, like this:
 

+ 125 - 81
database/engine/README.md

@@ -6,74 +6,114 @@ custom_edit_url: https://github.com/netdata/netdata/edit/master/database/engine/
 
 # Database engine
 
-The Database Engine works like a traditional database. It dedicates a certain amount of RAM to data caching and
-indexing, while the rest of the data resides compressed on disk. Unlike other [database modes](/database/README.md), the
-amount of historical metrics stored is based on the amount of disk space you allocate and the effective compression
+The Database Engine works like a traditional time series database. Unlike other [database modes](/database/README.md),
+the amount of historical metrics stored is based on the amount of disk space you allocate and the effective compression
 ratio, not a fixed number of metrics collected.
 
-By using both RAM and disk space, the database engine allows for long-term storage of per-second metrics inside of the
-Agent itself.
+## Tiering
 
-In addition, the dbengine is the only mode that supports changing the data collection update frequency
-(`update every`) without losing the metrics your Agent already gathered and stored.
+Tiering is a mechanism of providing multiple tiers of data with
+different [granularity on metrics](/docs/store/distributed-data-architecture.md#granularity-of-metrics).
 
-## Configuration
+For Netdata Agents with version `netdata-1.35.0.138.nightly` and greater, `dbengine` supports Tiering, allowing almost
+unlimited retention of data.
 
-To use the database engine, open `netdata.conf` and set `[db].mode` to `dbengine`.
 
-```conf
+### Metric size
+
+Every Tier down samples the exact lower tier (lower tiers have greater resolution). You can have up to 5
+Tiers **[0. . 4]** of data (including the Tier 0, which has the highest resolution)
+
+Tier 0 is the default that was always available in `dbengine` mode. Tier 1 is the first level of aggregation, Tier 2 is
+the second, and so on.
+
+Metrics on all tiers except of the _Tier 0_ also store the following five additional values for every point for accurate
+representation:
+
+1. The `sum` of the points aggregated
+2. The `min` of the points aggregated
+3. The `max` of the points aggregated
+4. The `count` of the points aggregated (could be constant, but it may not be due to gaps in data collection)
+5. The `anomaly_count` of the points aggregated (how many of the aggregated points found anomalous)
+
+Among `min`, `max` and `sum`, the correct value is chosen based on the user query. `average` is calculated on the fly at
+query time.
+
+### Tiering in a nutshell
+
+The `dbengine` is capable of retaining metrics for years. To further understand the `dbengine` tiering mechanism let's
+explore the following configuration.
+
+```
 [db]
     mode = dbengine
+    
+    # per second data collection
+    update every = 1
+    
+    # enables Tier 1 and Tier 2, Tier 0 is always enabled in dbengine mode
+    storage tiers = 3
+    
+    # Tier 0, per second data for a week
+    dbengine multihost disk space MB = 1100
+    
+    # Tier 1, per minute data for a month
+    dbengine tier 1 multihost disk space MB = 330
+
+    # Tier 2, per hour data for a year
+    dbengine tier 2 multihost disk space MB = 67
 ```
 
-To configure the database engine, look for the `dbengine page cache size MB` and `dbengine multihost disk space MB` settings in the
-`[db]` section of your `netdata.conf`. The Agent ignores the `[db].retention` setting when using the dbengine.
+For 2000 metrics, collected every second and retained for a week, Tier 0 needs: 1 byte x 2000 metrics x 3600 secs per
+hour x 24 hours per day x 7 days per week = 1100MB.
 
-```conf
-[db]
-    dbengine page cache size MB = 32
-    dbengine multihost disk space MB = 256
-```
+By setting `dbengine multihost disk space MB` to `1100`, this node will start maintaining about a week of data. But pay
+attention to the number of metrics. If you have more than 2000 metrics on a node, or you need more that a week of high
+resolution metrics, you may need to adjust this setting accordingly.
+
+Tier 1 is by default sampling the data every **60 points of Tier 0**. In our case, Tier 0 is per second, if we want to
+transform this information in terms of time then the Tier 1 "resolution" is per minute.
+
+Tier 1 needs four times more storage per point compared to Tier 0. So, for 2000 metrics, with per minute resolution,
+retained for a month, Tier 1 needs: 4 bytes x 2000 metrics x 60 minutes per hour x 24 hours per day x 30 days per month
+= 330MB.
+
+Tier 2 is by default sampling data every 3600 points of Tier 0 (60 of Tier 1, which is the previous exact Tier). Again
+in term of "time" (Tier 0 is per second), then Tier 2 is per hour.
+
+The storage requirements are the same to Tier 1.
 
-The above values are the default values for Page Cache size and DB engine disk space quota.
+For 2000 metrics, with per hour resolution, retained for a year, Tier 2 needs: 4 bytes x 2000 metrics x 24 hours per day
+x 365 days per year = 67MB.
 
-The `dbengine page cache size MB` option determines the amount of RAM dedicated to caching Netdata metric values. The
-actual page cache size will be slightly larger than this figure—see the [memory requirements](#memory-requirements)
-section for details.
+## Legacy configuration
 
-The `dbengine multihost disk space MB` option determines the amount of disk space that is dedicated to storing
-Netdata metric values and all related metadata describing them. You can use the [**database engine
-calculator**](/docs/store/change-metrics-storage.md#calculate-the-system-resources-ram-disk-space-needed-to-store-metrics)
-to correctly set `dbengine multihost disk space MB` based on your metrics retention policy. The calculator gives an
-accurate estimate based on how many child nodes you have, how many metrics your Agent collects, and more.
+### v1.35.1 and prior
 
-### Legacy configuration
+These versions of the Agent do not support [Tiering](#Tiering). You could change the metric retention for the parent and
+all of its children only with the `dbengine multihost disk space MB` setting. This setting accounts the space allocation
+for the parent node and all of its children.
 
-The deprecated `dbengine disk space MB` option determines the amount of disk space that is dedicated to storing
-Netdata metric values per legacy database engine instance (see [details on the legacy mode](#legacy-mode) below).
+To configure the database engine, look for the `page cache size MB` and `dbengine multihost disk space MB` settings in
+the `[db]` section of your `netdata.conf`.
 
 ```conf
 [db]
-    dbengine disk space MB = 256
+    dbengine page cache size MB = 32
+    dbengine multihost disk space MB = 256
 ```
 
-### Streaming metrics to the database engine
-
-When using the multihost database engine, all parent and child nodes share the same `dbengine page cache size MB` and `dbengine
-multihost disk space MB` in a single dbengine instance. The [**database engine
-calculator**](/docs/store/change-metrics-storage.md#calculate-the-system-resources-ram-disk-space-needed-to-store-metrics)
-helps you properly set `dbengine page cache size MB` and `dbengine multihost disk space MB` on your parent node to allocate enough
-resources based on your metrics retention policy and how many child nodes you have.
-
-#### Legacy mode
+### v1.23.2 and prior
 
 _For Netdata Agents earlier than v1.23.2_, the Agent on the parent node uses one dbengine instance for itself, and
 another instance for every child node it receives metrics from. If you had four streaming nodes, you would have five
 instances in total (`1 parent + 4 child nodes = 5 instances`).
 
-The Agent allocates resources for each instance separately using the `dbengine disk space MB` (**deprecated**) setting. If
-`dbengine disk space MB`(**deprecated**) is set to the default `256`, each instance is given 256 MiB in disk space, which
-means the total disk space required to store all instances is, roughly, `256 MiB * 1 parent * 4 child nodes = 1280 MiB`.
+The Agent allocates resources for each instance separately using the `dbengine disk space MB` (**deprecated**) setting.
+If
+`dbengine disk space MB`(**deprecated**) is set to the default `256`, each instance is given 256 MiB in disk space,
+which means the total disk space required to store all instances is,
+roughly, `256 MiB * 1 parent * 4 child nodes = 1280 MiB`.
 
 #### Backward compatibility
 
@@ -90,41 +130,44 @@ Agent.
 For more information about setting `[db].mode` on your nodes, in addition to other streaming configurations, see
 [streaming](/streaming/README.md).
 
-### Memory requirements
+## Requirements & limitations
+
+### Memory
 
 Using database mode `dbengine` we can overcome most memory restrictions and store a dataset that is much larger than the
 available memory.
 
 There are explicit memory requirements **per** DB engine **instance**:
 
--   The total page cache memory footprint will be an additional `#dimensions-being-collected x 4096 x 2` bytes over what
-    the user configured with `dbengine page cache size MB`.
+- The total page cache memory footprint will be an additional `#dimensions-being-collected x 4096 x 2` bytes over what
+  the user configured with `dbengine page cache size MB`.
+
 
--   an additional `#pages-on-disk x 4096 x 0.03` bytes of RAM are allocated for metadata.
+- an additional `#pages-on-disk x 4096 x 0.03` bytes of RAM are allocated for metadata.
 
-    -   roughly speaking this is 3% of the uncompressed disk space taken by the DB files.
+    - roughly speaking this is 3% of the uncompressed disk space taken by the DB files.
 
-    -   for very highly compressible data (compression ratio > 90%) this RAM overhead is comparable to the disk space
-        footprint.
+    - for very highly compressible data (compression ratio > 90%) this RAM overhead is comparable to the disk space
+      footprint.
 
 An important observation is that RAM usage depends on both the `page cache size` and the `dbengine multihost disk space`
 options.
 
-You can use our [database engine
-calculator](/docs/store/change-metrics-storage.md#calculate-the-system-resources-ram-disk-space-needed-to-store-metrics)
+You can use
+our [database engine calculator](/docs/store/change-metrics-storage.md#calculate-the-system-resources-ram-disk-space-needed-to-store-metrics)
 to validate the memory requirements for your particular system(s) and configuration (**out-of-date**).
 
-### Disk space requirements
+### Disk space
 
 There are explicit disk space requirements **per** DB engine **instance**:
 
--   The total disk space footprint will be the maximum between `#dimensions-being-collected x 4096 x 2` bytes or what
-    the user configured with `dbengine multihost disk space` or `dbengine disk space`.
+- The total disk space footprint will be the maximum between `#dimensions-being-collected x 4096 x 2` bytes or what the
+  user configured with `dbengine multihost disk space` or `dbengine disk space`.
 
-### File descriptor requirements
+### File descriptor
 
-The Database Engine may keep a **significant** amount of files open per instance (e.g. per streaming child or
-parent server). When configuring your system you should make sure there are at least 50 file descriptors available per
+The Database Engine may keep a **significant** amount of files open per instance (e.g. per streaming child or parent
+server). When configuring your system you should make sure there are at least 50 file descriptors available per
 `dbengine` instance.
 
 Netdata allocates 25% of the available file descriptors to its Database Engine instances. This means that only 25% of
@@ -148,7 +191,7 @@ ulimit -n 65536
 ```
 
 at the beginning of the service file. Alternatively you can change the system-wide limits of the kernel by changing
- `/etc/sysctl.conf`. For linux that would be:
+`/etc/sysctl.conf`. For linux that would be:
 
 ```conf
 fs.file-max = 65536
@@ -165,8 +208,8 @@ You can apply the settings by running `sysctl -p` or by rebooting.
 
 ## Files
 
-With the DB engine mode the metric data are stored in database files. These files are organized in pairs, the
-datafiles and their corresponding journalfiles, e.g.:
+With the DB engine mode the metric data are stored in database files. These files are organized in pairs, the datafiles
+and their corresponding journalfiles, e.g.:
 
 ```sh
 datafile-1-0000000001.ndf
@@ -191,15 +234,16 @@ storage at lower granularity.
 The DB engine stores chart metric values in 4096-byte pages in memory. Each chart dimension gets its own page to store
 consecutive values generated from the data collectors. Those pages comprise the **Page Cache**.
 
-When those pages fill up they are slowly compressed and flushed to disk. It can take `4096 / 4 = 1024 seconds = 17
-minutes`, for a chart dimension that is being collected every 1 second, to fill a page. Pages can be cut short when we
-stop Netdata or the DB engine instance so as to not lose the data. When we query the DB engine for data we trigger disk
-read I/O requests that fill the Page Cache with the requested pages and potentially evict cold (not recently used)
-pages. 
+When those pages fill up, they are slowly compressed and flushed to disk. It can
+take `4096 / 4 = 1024 seconds = 17 minutes`, for a chart dimension that is being collected every 1 second, to fill a
+page. Pages can be cut short when we stop Netdata or the DB engine instance so as to not lose the data. When we query
+the DB engine for data we trigger disk read I/O requests that fill the Page Cache with the requested pages and
+potentially evict cold (not recently used)
+pages.
 
 When the disk quota is exceeded the oldest values are removed from the DB engine at real time, by automatically deleting
 the oldest datafile and journalfile pair. Any corresponding pages residing in the Page Cache will also be invalidated
-and removed. The DB engine logic will try to maintain between 10 and 20 file pairs at any point in time. 
+and removed. The DB engine logic will try to maintain between 10 and 20 file pairs at any point in time.
 
 The Database Engine uses direct I/O to avoid polluting the OS filesystem caches and does not generate excessive I/O
 traffic so as to create the minimum possible interference with other applications.
@@ -214,19 +258,19 @@ Constellation ES.3 2TB magnetic HDD and a SAMSUNG MZQLB960HAJR-00007 960GB NAND
 For our workload, we defined 32 charts with 128 metrics each, giving us a total of 4096 metrics. We defined 1 worker
 thread per chart (32 threads) that generates new data points with a data generation interval of 1 second. The time axis
 of the time-series is emulated and accelerated so that the worker threads can generate as many data points as possible
-without delays. 
+without delays.
 
-We also defined 32 worker threads that perform queries on random metrics with semi-random time ranges. The
-starting time of the query is randomly selected between the beginning of the time-series and the time of the latest data
-point. The ending time is randomly selected between 1 second and 1 hour after the starting time. The pseudo-random
-numbers are generated with a uniform distribution.
+We also defined 32 worker threads that perform queries on random metrics with semi-random time ranges. The starting time
+of the query is randomly selected between the beginning of the time-series and the time of the latest data point. The
+ending time is randomly selected between 1 second and 1 hour after the starting time. The pseudo-random numbers are
+generated with a uniform distribution.
 
 The data are written to the database at the same time as they are read from it. This is a concurrent read/write mixed
-workload with a duration of 60 seconds. The faster `dbengine` runs, the bigger the dataset size becomes since more
-data points will be generated. We set a page cache size of 64MiB for the two disk-bound scenarios. This way, the dataset
-size of the metric data is much bigger than the RAM that is being used for caching so as to trigger I/O requests most
-of the time. In our final scenario, we set the page cache size to 16 GiB. That way, the dataset fits in the page cache
-so as to avoid all disk bottlenecks.
+workload with a duration of 60 seconds. The faster `dbengine` runs, the bigger the dataset size becomes since more data
+points will be generated. We set a page cache size of 64MiB for the two disk-bound scenarios. This way, the dataset size
+of the metric data is much bigger than the RAM that is being used for caching so as to trigger I/O requests most of the
+time. In our final scenario, we set the page cache size to 16 GiB. That way, the dataset fits in the page cache so as to
+avoid all disk bottlenecks.
 
 The reported numbers are the following:
 
@@ -237,15 +281,15 @@ The reported numbers are the following:
 |  N/A   |   16 GiB   | 6.8 GiB |    118.2M |      30.2M |
 
 where "reads/sec" is the number of metric data points being read from the database via its API per second and
-"writes/sec" is the number of metric data points being written to the database per second. 
+"writes/sec" is the number of metric data points being written to the database per second.
 
 Notice that the HDD numbers are pretty high and not much slower than the SSD numbers. This is thanks to the database
 engine design being optimized for rotating media. In the database engine disk I/O requests are:
 
--   asynchronous to mask the high I/O latency of HDDs.
--   mostly large to reduce the amount of HDD seeking time.
--   mostly sequential to reduce the amount of HDD seeking time.
--   compressed to reduce the amount of required throughput.
+- asynchronous to mask the high I/O latency of HDDs.
+- mostly large to reduce the amount of HDD seeking time.
+- mostly sequential to reduce the amount of HDD seeking time.
+- compressed to reduce the amount of required throughput.
 
 As a result, the HDD is not thousands of times slower than the SSD, which is typical for other workloads.
 

+ 59 - 36
docs/store/change-metrics-storage.md

@@ -6,72 +6,95 @@ custom_edit_url: https://github.com/netdata/netdata/edit/master/docs/store/chang
 
 # Change how long Netdata stores metrics
 
-import { Calculator } from '../../src/components/agent/dbCalc/'
+The Netdata Agent uses a custom made time-series database (TSDB), named the [`dbengine`](/database/engine/README.md), to store metrics.
 
-The Netdata Agent uses a time-series database (TSDB), named the [database engine
-(`dbengine`)](/database/engine/README.md), to store metrics data. The most recently-collected metrics are stored in RAM,
-and when metrics reach a certain age, and based on how much system RAM you allocate toward storing metrics in memory,
-they are compressed and "spilled" to disk for long-term storage.
+The default settings retain approximately two day's worth of metrics on a system collecting 2,000 metrics every second,
+but the Netdata Agent is highly configurable if you want your nodes to store days, weeks, or months worth of per-second
+data.
 
-The default settings retain about two day's worth of metrics on a system collecting 2,000 metrics every second, but the
-Netdata Agent is highly configurable if you want your nodes to store days, weeks, or months worth of per-second data.
-
-The Netdata Agent uses two settings in `netdata.conf` to change the behavior of the database engine:
+The Netdata Agent uses the following three fundamental settings in `netdata.conf` to change the behavior of the database engine:
 
 ```conf
 [global]
-    page cache size = 32
+    dbengine page cache size = 32
     dbengine multihost disk space = 256
+    storage tiers = 1
 ```
 
-`page cache size` sets the maximum amount of RAM (in MiB) the database engine uses to cache and index recent metrics.
+`dbengine page cache size` sets the maximum amount of RAM (in MiB) the database engine uses to cache and index recent
+metrics.
 `dbengine multihost disk space` sets the maximum disk space (again, in MiB) the database engine uses to store
-historical, compressed metrics. When the size of stored metrics exceeds the allocated disk space, the database engine
-removes the oldest metrics on a rolling basis.
+historical, compressed metrics and `storage tiers` specifies the number of storage tiers you want to have in
+your `dbengine`. When the size of stored metrics exceeds the allocated disk space, the database engine removes the
+oldest metrics on a rolling basis.
 
 ## Calculate the system resources (RAM, disk space) needed to store metrics
 
 You can store more or less metrics using the database engine by changing the allocated disk space. Use the calculator
-below to find an appropriate value for `dbengine multihost disk space` based on how many metrics your node(s) collect,
-whether you are streaming metrics to a parent node, and more.
+below to find the appropriate value for the `dbengine` based on how many metrics your node(s) collect, whether you are
+streaming metrics to a parent node, and more.
+
+You do not need to edit the `dbengine page cache size` setting to store more metrics using the database engine. However,
+if you want to store more metrics _specifically in memory_, you can increase the cache size.
+
+:::tip
+
+We advise you to visit the [tiering mechanism](/database/engine/README.md#tiering) reference. This will help you
+configure the Agent to retain metrics for longer periods.
 
-You do not need to edit the `page cache size` setting to store more metrics using the database engine. However, if you
-want to store more metrics _specifically in memory_, you can increase the cache size.
+:::
 
-> ⚠️ This calculator provides an estimate of disk and RAM usage for **metrics storage**, along with its best
-> recommendation for the `dbengine multihost disk space` setting. Real-life usage may vary based on the accuracy of the
-> values you enter below, changes in the compression ratio, and the types of metrics stored.
+:::caution
 
-<Calculator />
+This calculator provides an estimation of disk and RAM usage for **metrics usage**. Real-life usage may vary based on
+the accuracy of the values you enter below, changes in the compression ratio, and the types of metrics stored.
+
+:::
+
+Download
+the [calculator](https://docs.google.com/spreadsheets/d/e/2PACX-1vTYMhUU90aOnIQ7qF6iIk6tXps57wmY9lxS6qDXznNJrzCKMDzxU3zkgh8Uv0xj_XqwFl3U6aHDZ6ag/pub?output=xlsx)
+to optimize the data retention to your preferences. Utilize the "Front" spreadsheet. Experiment with the variables which
+are padded with yellow to come up with the best settings for your use case.
 
 ## Edit `netdata.conf` with recommended database engine settings
 
-Now that you have a recommended setting for `dbengine multihost disk space`, open `netdata.conf` with
-[`edit-config`](/docs/configure/nodes.md#use-edit-config-to-edit-configuration-files) and look for the `dbengine
-multihost disk space` setting. Change it to the value recommended above. For example:
+Now that you have a recommended setting for your Agent's `dbengine`, open `netdata.conf` with
+[`edit-config`](/docs/configure/nodes.md#use-edit-config-to-edit-configuration-files) and look for the `[db]`
+subsection. Change it to the recommended values you calculated from the calculator. For example:
 
 ```conf
-[global]
-    dbengine multihost disk space = 1024
+[db]
+   mode = dbengine
+   storage tiers = 3
+   update every = 1
+   dbengine multihost disk space MB = 1024
+   dbengine page cache size MB = 32
+   dbengine tier 1 update every iterations = 60
+   dbengine tier 1 multihost disk space MB = 384
+   dbengine tier 1 page cache size MB = 32
+   dbengine tier 2 update every iterations = 60
+   dbengine tier 2 multihost disk space MB = 16
+   dbengine tier 2 page cache size MB = 32
 ```
 
-Save the file and restart the Agent with `sudo systemctl restart netdata`, or the [appropriate
-method](/docs/configure/start-stop-restart.md) for your system, to change the database engine's size.
+Save the file and restart the Agent with `sudo systemctl restart netdata`, or
+the [appropriate method](/docs/configure/start-stop-restart.md) for your system, to change the database engine's size.
 
 ## What's next?
 
-If you have multiple nodes with the Netdata Agent installed, you can [stream
-metrics](/docs/metrics-storage-management/how-streaming-works.mdx) from any number of _child_ nodes to a _parent_ node
-and store metrics using a centralized time-series database. Streaming allows you to centralize your data, run Agents as
-headless collectors, replicate data, and more.
+If you have multiple nodes with the Netdata Agent installed, you
+can [stream metrics](/docs/metrics-storage-management/how-streaming-works.mdx) from any number of _child_ nodes to a _
+parent_ node and store metrics using a centralized time-series database. Streaming allows you to centralize your data,
+run Agents as headless collectors, replicate data, and more.
 
-Storing metrics with the database engine is completely interoperable with [exporting to other time-series
-databases](/docs/export/external-databases.md). With exporting, you can use the node's resources to surface metrics
-when [viewing dashboards](/docs/visualize/interact-dashboards-charts.md), while also archiving metrics elsewhere for
-further analysis, visualization, or correlation with other tools. 
+Storing metrics with the database engine is completely interoperable
+with [exporting to other time-series databases](/docs/export/external-databases.md). With exporting, you can use the
+node's resources to surface metrics when [viewing dashboards](/docs/visualize/interact-dashboards-charts.md), while also
+archiving metrics elsewhere for further analysis, visualization, or correlation with other tools.
 
 ### Related reference documentation
 
 - [Netdata Agent · Database engine](/database/engine/README.md)
+- [Netdata Agent · Database engine configuration option](/daemon/config/README.md#[db]-section-options)
 
 

+ 42 - 25
docs/store/distributed-data-architecture.md

@@ -10,34 +10,43 @@ Netdata uses a distributed data architecture to help you collect and store per-s
 Every node in your infrastructure, whether it's one or a thousand, stores the metrics it collects.
 
 Netdata Cloud bridges the gap between many distributed databases by _centralizing the interface_ you use to query and
-visualize your nodes' metrics. When you [look at charts in Netdata
-Cloud](/docs/visualize/interact-dashboards-charts.md), the metrics values are queried directly from that node's database
-and securely streamed to Netdata Cloud, which proxies them to your browser.
+visualize your nodes' metrics. When you [look at charts in Netdata Cloud](/docs/visualize/interact-dashboards-charts.md)
+, the metrics values are queried directly from that node's database and securely streamed to Netdata Cloud, which
+proxies them to your browser.
 
 Netdata's distributed data architecture has a number of benefits:
 
--   **Performance**: Every query to a node's database takes only a few milliseconds to complete for responsiveness when
-    viewing dashboards or using features like [Metric
-    Correlations](https://learn.netdata.cloud/docs/cloud/insights/metric-correlations).
--   **Scalability**: As your infrastructure scales, install the Netdata Agent on every new node to immediately add it to
-    your monitoring solution without adding cost or complexity.
--   **1-second granularity**: Without an expensive centralized data lake, you can store all of your nodes' per-second
-    metrics, for any period of time, while keeping costs down.
--   **No filtering or selecting of metrics**: Because Netdata's distributed data architecture allows you to store all
-    metrics, you don't have to configure which metrics you retain. Keep everything for full visibility during
-    troubleshooting and root cause analysis.
--   **Easy maintenance**: There is no centralized data lake to purchase, allocate, monitor, and update, removing
-    complexity from your monitoring infrastructure.
+- **Performance**: Every query to a node's database takes only a few milliseconds to complete for responsiveness when
+  viewing dashboards or using features
+  like [Metric Correlations](https://learn.netdata.cloud/docs/cloud/insights/metric-correlations).
+- **Scalability**: As your infrastructure scales, install the Netdata Agent on every new node to immediately add it to
+  your monitoring solution without adding cost or complexity.
+- **1-second granularity**: Without an expensive centralized data lake, you can store all of your nodes' per-second
+  metrics, for any period of time, while keeping costs down.
+- **No filtering or selecting of metrics**: Because Netdata's distributed data architecture allows you to store all
+  metrics, you don't have to configure which metrics you retain. Keep everything for full visibility during
+  troubleshooting and root cause analysis.
+- **Easy maintenance**: There is no centralized data lake to purchase, allocate, monitor, and update, removing
+  complexity from your monitoring infrastructure.
 
-## Does Netdata Cloud store my metrics?
+## Ephemerality of metrics
 
-Netdata Cloud does not store metric values. 
+The ephemerality of metrics plays an important role in retention. In environments where metrics collection is dynamic and
+new metrics are constantly being generated, we are interested about 2 parameters:
 
-To enable certain features, such as [viewing active alarms](/docs/monitor/view-active-alarms.md) or [filtering by
-hostname/service](https://learn.netdata.cloud/docs/cloud/war-rooms#node-filter), Netdata Cloud does store configured
-alarms, their status, and a list of active collectors.
+1. The **expected concurrent number of metrics** as an average for the lifetime of the database. This affects mainly the
+   storage requirements.
 
-Netdata does not and never will sell your personal data or data about your deployment.
+2. The **expected total number of unique metrics** for the lifetime of the database. This affects mainly the memory
+   requirements for having all these metrics indexed and available to be queried.
+
+## Granularity of metrics
+
+The granularity of metrics (the frequency they are collected and stored, i.e. their resolution) is significantly
+affecting retention.
+
+Lowering the granularity from per second to every two seconds, will double their retention and half the CPU requirements
+of the Netdata Agent, without affecting disk space or memory requirements.
 
 ## Long-term metrics storage with Netdata
 
@@ -47,7 +56,8 @@ appropriate amount of RAM and disk space.
 Read our document on changing [how long Netdata stores metrics](/docs/store/change-metrics-storage.md) on your nodes for
 details.
 
-## Other options for your metrics data
+You can also stream between nodes using [streaming](/streaming/README.md), allowing to replicate databases and create
+your own centralized data lake of metrics, if you choose to do so.
 
 While a distributed data architecture is the default when monitoring infrastructure with Netdata, you can also configure
 its behavior based on your needs or the type of infrastructure you manage.
@@ -55,12 +65,19 @@ its behavior based on your needs or the type of infrastructure you manage.
 To archive metrics to an external time-series database, such as InfluxDB, Graphite, OpenTSDB, Elasticsearch,
 TimescaleDB, and many others, see details on [integrating Netdata via exporting](/docs/export/external-databases.md).
 
-You can also stream between nodes using [streaming](/streaming/README.md), allowing to replicate databases and create
-your own centralized data lake of metrics, if you choose to do so.
-
 When you use the database engine to store your metrics, you can always perform a quick backup of a node's
 `/var/cache/netdata/dbengine/` folder using the tool of your choice.
 
+## Does Netdata Cloud store my metrics?
+
+Netdata Cloud does not store metric values.
+
+To enable certain features, such as [viewing active alarms](/docs/monitor/view-active-alarms.md)
+or [filtering by hostname/service](https://learn.netdata.cloud/docs/cloud/war-rooms#node-filter), Netdata Cloud does
+store configured alarms, their status, and a list of active collectors.
+
+Netdata does not and never will sell your personal data or data about your deployment.
+
 ## What's next?
 
 You can configure the Netdata Agent to store days, weeks, or months worth of distributed, per-second data by