README.md 29 KB

log2journal

log2journal and systemd-cat-native can be used to convert a structured log file, such as the ones generated by web servers, into systemd-journal entries.

By combining these tools, together with the usual UNIX shell tools you can create advanced log processing pipelines sending any kind of structured text logs to systemd-journald. This is a simple, but powerful and efficient way to handle log processing.

The process involves the usual piping of shell commands, to get and process the log files in realtime.

The overall process looks like this:

tail -F /var/log/nginx/*.log       |\  # outputs log lines
  log2journal 'PATTERN'            |\  # outputs Journal Export Format
  systemd-cat-native                   # send to local/remote journald

Let's see the steps:

  1. tail -F /var/log/nginx/*.log
    this command will tail all *.log files in /var/log/nginx/. We use -F instead of -f to ensure that files will still be tailed after log rotation.
  2. log2joural is a Netdata program. It reads log entries and extracts fields, according to the PCRE2 pattern it accepts. It can also apply some basic operations on the fields, like injecting new fields or duplicating existing ones or rewriting their values. The output of log2journal is in Systemd Journal Export Format, and it looks like this:

    KEY1=VALUE1 # << start of the first log line
    KEY2=VALUE2
               # << log lines separator
    KEY1=VALUE1 # << start of the second log line
    KEY2=VALUE2
    
  3. systemd-cat-native is a Netdata program. I can send the logs to a local systemd-journald (journal namespaces supported), or to a remote systemd-journal-remote.

Real-life example

We have an nginx server logging in this format:

        log_format access '$remote_addr - $remote_user [$time_local] '
                    '"$request" $status $body_bytes_sent '
                    '$request_length $request_time '
                    '"$http_referer" "$http_user_agent"';

First, let's find the right pattern for log2journal. We ask ChatGPT:

My nginx log uses this log format:

log_format access '$remote_addr - $remote_user [$time_local] '
                    '"$request" $status $body_bytes_sent '
                    '$request_length $request_time '
                    '"$http_referer" "$http_user_agent"';

I want to use `log2joural` to convert this log for systemd-journal.
`log2journal` accepts a PCRE2 regular expression, using the named groups
in the pattern as the journal fields to extract from the logs.

Prefix all PCRE2 group names with `NGINX_` and use capital characters only. 

For the $request, use the field `MESSAGE` (without NGINX_ prefix), so that
it will appear in systemd journals as the message of the log.

Please give me the PCRE2 pattern.

ChatGPT replies with this:

^(?<NGINX_REMOTE_ADDR>[^ ]+) - (?<NGINX_REMOTE_USER>[^ ]+) \[(?<NGINX_TIME_LOCAL>[^\]]+)\] "(?<MESSAGE>[^"]+)" (?<NGINX_STATUS>\d+) (?<NGINX_BODY_BYTES_SENT>\d+) (?<NGINX_REQUEST_LENGTH>\d+) (?<NGINX_REQUEST_TIME>[\d.]+) "(?<NGINX_HTTP_REFERER>[^"]*)" "(?<NGINX_HTTP_USER_AGENT>[^"]*)"

Let's test it with a sample line (instead of tail):

# echo '1.2.3.4 - - [19/Nov/2023:00:24:43 +0000] "GET /index.html HTTP/1.1" 200 4172 104 0.001 "-" "Go-http-client/1.1"' | log2journal '^(?<NGINX_REMOTE_ADDR>[^ ]+) - (?<NGINX_REMOTE_USER>[^ ]+) \[(?<NGINX_TIME_LOCAL>[^\]]+)\] "(?<MESSAGE>[^"]+)" (?<NGINX_STATUS>\d+) (?<NGINX_BODY_BYTES_SENT>\d+) (?<NGINX_REQUEST_LENGTH>\d+) (?<NGINX_REQUEST_TIME>[\d.]+) "(?<NGINX_HTTP_REFERER>[^"]*)" "(?<NGINX_HTTP_USER_AGENT>[^"]*)"'
MESSAGE=GET /index.html HTTP/1.1
NGINX_BODY_BYTES_SENT=4172
NGINX_HTTP_REFERER=-
NGINX_HTTP_USER_AGENT=Go-http-client/1.1
NGINX_REMOTE_ADDR=1.2.3.4
NGINX_REMOTE_USER=-
NGINX_REQUEST_LENGTH=104
NGINX_REQUEST_TIME=0.001
NGINX_STATUS=200
NGINX_TIME_LOCAL=19/Nov/2023:00:24:43 +0000

As you can see, it extracted all the fields.

The MESSAGE however, has 3 fields by itself: the method, the URL and the procotol version. Let's ask ChatGPT to extract these too:

I see that the MESSAGE has 3 key items in it. The request method (GET, POST,
etc), the URL and HTTP protocol version.

I want to keep the MESSAGE as it is, with all the information in it, but also
extract the 3 items from it as separate fields.

Can this be done?

ChatGPT responded with this:

^(?<NGINX_REMOTE_ADDR>[^ ]+) - (?<NGINX_REMOTE_USER>[^ ]+) \[(?<NGINX_TIME_LOCAL>[^\]]+)\] "(?<MESSAGE>(?<NGINX_METHOD>[A-Z]+) (?<NGINX_URL>[^ ]+) HTTP/(?<NGINX_HTTP_VERSION>[^"]+))" (?<NGINX_STATUS>\d+) (?<NGINX_BODY_BYTES_SENT>\d+) (?<NGINX_REQUEST_LENGTH>\d+) (?<NGINX_REQUEST_TIME>[\d.]+) "(?<NGINX_HTTP_REFERER>[^"]*)" "(?<NGINX_HTTP_USER_AGENT>[^"]*)"

Let's test this too:

# echo '1.2.3.4 - - [19/Nov/2023:00:24:43 +0000] "GET /index.html HTTP/1.1" 200 4172 104 0.001 "-" "Go-http-client/1.1"' | log2journal '^(?<NGINX_REMOTE_ADDR>[^ ]+) - (?<NGINX_REMOTE_USER>[^ ]+) \[(?<NGINX_TIME_LOCAL>[^\]]+)\] "(?<MESSAGE>(?<NGINX_METHOD>[A-Z]+) (?<NGINX_URL>[^ ]+) HTTP/(?<NGINX_HTTP_VERSION>[^"]+))" (?<NGINX_STATUS>\d+) (?<NGINX_BODY_BYTES_SENT>\d+) (?<NGINX_REQUEST_LENGTH>\d+) (?<NGINX_REQUEST_TIME>[\d.]+) "(?<NGINX_HTTP_REFERER>[^"]*)" "(?<NGINX_HTTP_USER_AGENT>[^"]*)"'
MESSAGE=GET /index.html HTTP/1.1              # <<<<<<<<< MESSAGE
NGINX_BODY_BYTES_SENT=4172
NGINX_HTTP_REFERER=-
NGINX_HTTP_USER_AGENT=Go-http-client/1.1
NGINX_HTTP_VERSION=1.1                        # <<<<<<<<< VERSION
NGINX_METHOD=GET                              # <<<<<<<<< METHOD
NGINX_REMOTE_ADDR=1.2.3.4
NGINX_REMOTE_USER=-
NGINX_REQUEST_LENGTH=104
NGINX_REQUEST_TIME=0.001
NGINX_STATUS=200
NGINX_TIME_LOCAL=19/Nov/2023:00:24:43 +0000
NGINX_URL=/index.html                         # <<<<<<<<< URL

Ideally, we would want the 5xx errors to be red in our journalctl output. To achieve that we need to add a PRIORITY field to set the log level. Log priorities are numeric and follow the syslog priorities. Checking /usr/include/sys/syslog.h we can see these:

#define LOG_EMERG       0       /* system is unusable */
#define LOG_ALERT       1       /* action must be taken immediately */
#define LOG_CRIT        2       /* critical conditions */
#define LOG_ERR         3       /* error conditions */
#define LOG_WARNING     4       /* warning conditions */
#define LOG_NOTICE      5       /* normal but significant condition */
#define LOG_INFO        6       /* informational */
#define LOG_DEBUG       7       /* debug-level messages */

Avoid setting priority to 0 (LOG_EMERG), because these will be on your terminal (the journal uses wall to let you know of such events). A good priority for errors is 3 (red in journalctl), or 4 (yellow in journalctl).

To set the PRIORITY field in the output, we can use NGINX_STATUS fields. We need a copy of it, which we will alter later.

We can instruct log2journal to duplicate NGINX_STATUS, like this: log2journal --duplicate=PRIORITY=NGINX_STATUS. Let's try it:

# echo '1.2.3.4 - - [19/Nov/2023:00:24:43 +0000] "GET /index.html HTTP/1.1" 200 4172 104 0.001 "-" "Go-http-client/1.1"' | log2journal '^(?<NGINX_REMOTE_ADDR>[^ ]+) - (?<NGINX_REMOTE_USER>[^ ]+) \[(?<NGINX_TIME_LOCAL>[^\]]+)\] "(?<MESSAGE>(?<NGINX_METHOD>[A-Z]+) (?<NGINX_URL>[^ ]+) HTTP/(?<NGINX_HTTP_VERSION>[^"]+))" (?<NGINX_STATUS>\d+) (?<NGINX_BODY_BYTES_SENT>\d+) (?<NGINX_REQUEST_LENGTH>\d+) (?<NGINX_REQUEST_TIME>[\d.]+) "(?<NGINX_HTTP_REFERER>[^"]*)" "(?<NGINX_HTTP_USER_AGENT>[^"]*)"' --duplicate=PRIORITY=NGINX_STATUS
MESSAGE=GET /index.html HTTP/1.1
NGINX_BODY_BYTES_SENT=4172
NGINX_HTTP_REFERER=-
NGINX_HTTP_USER_AGENT=Go-http-client/1.1
NGINX_HTTP_VERSION=1.1
NGINX_METHOD=GET
NGINX_REMOTE_ADDR=1.2.3.4
NGINX_REMOTE_USER=-
NGINX_REQUEST_LENGTH=104
NGINX_REQUEST_TIME=0.001
NGINX_STATUS=200
PRIORITY=200                                 # <<<<<<<<< PRIORITY IS HERE
NGINX_TIME_LOCAL=19/Nov/2023:00:24:43 +0000
NGINX_URL=/index.html

Now that we have the PRIORITY field equal to the NGINX_STATUS, we can use instruct log2journal to change it to a valid priority, by appending: --rewrite=PRIORITY=/^5/3 --rewrite=PRIORITY=/.*/6. These rewrite commands say to match everything that starts with 5 and replace it with priority 3 (error) and everything else with priority 6 (info). Let's see it:

# echo '1.2.3.4 - - [19/Nov/2023:00:24:43 +0000] "GET /index.html HTTP/1.1" 200 4172 104 0.001 "-" "Go-http-client/1.1"' | log2journal '^(?<NGINX_REMOTE_ADDR>[^ ]+) - (?<NGINX_REMOTE_USER>[^ ]+) \[(?<NGINX_TIME_LOCAL>[^\]]+)\] "(?<MESSAGE>(?<NGINX_METHOD>[A-Z]+) (?<NGINX_URL>[^ ]+) HTTP/(?<NGINX_HTTP_VERSION>[^"]+))" (?<NGINX_STATUS>\d+) (?<NGINX_BODY_BYTES_SENT>\d+) (?<NGINX_REQUEST_LENGTH>\d+) (?<NGINX_REQUEST_TIME>[\d.]+) "(?<NGINX_HTTP_REFERER>[^"]*)" "(?<NGINX_HTTP_USER_AGENT>[^"]*)"' --duplicate=STATUS2PRIORITY=NGINX_STATUS --rewrite=PRIORITY=/^5/3 --rewrite=PRIORITY=/.*/6
MESSAGE=GET /index.html HTTP/1.1
NGINX_BODY_BYTES_SENT=4172
NGINX_HTTP_REFERER=-
NGINX_HTTP_USER_AGENT=Go-http-client/1.1
NGINX_HTTP_VERSION=1.1
NGINX_METHOD=GET
NGINX_REMOTE_ADDR=1.2.3.4
NGINX_REMOTE_USER=-
NGINX_REQUEST_LENGTH=104
NGINX_REQUEST_TIME=0.001
NGINX_STATUS=200
PRIORITY=6                                   # <<<<<<<<<< PRIORITY changed to 6
NGINX_TIME_LOCAL=19/Nov/2023:00:24:43 +0000
NGINX_URL=/index.html

Similarly, we could duplicate NGINX_URL to NGINX_ENDPOINT and then process it with sed to remove any query string, or replace IDs in the URL path with constant names, thus giving us uniform endpoints independently of the parameters.

To complete the example, we can also inject a SYSLOG_IDENTIFIER with log2journal, using --inject=SYSLOG_IDENTIFIER=nginx-log, like this:

# echo '1.2.3.4 - - [19/Nov/2023:00:24:43 +0000] "GET /index.html HTTP/1.1" 200 4172 104 0.001 "-" "Go-http-client/1.1"' | log2journal '^(?<NGINX_REMOTE_ADDR>[^ ]+) - (?<NGINX_REMOTE_USER>[^ ]+) \[(?<NGINX_TIME_LOCAL>[^\]]+)\] "(?<MESSAGE>(?<NGINX_METHOD>[A-Z]+) (?<NGINX_URL>[^ ]+) HTTP/(?<NGINX_HTTP_VERSION>[^"]+))" (?<NGINX_STATUS>\d+) (?<NGINX_BODY_BYTES_SENT>\d+) (?<NGINX_REQUEST_LENGTH>\d+) (?<NGINX_REQUEST_TIME>[\d.]+) "(?<NGINX_HTTP_REFERER>[^"]*)" "(?<NGINX_HTTP_USER_AGENT>[^"]*)"' --duplicate=STATUS2PRIORITY=NGINX_STATUS --inject=SYSLOG_IDENTIFIER=nginx -rewrite=PRIORITY=/^5/3 --rewrite=PRIORITY=/.*/6
MESSAGE=GET /index.html HTTP/1.1
NGINX_BODY_BYTES_SENT=4172
NGINX_HTTP_REFERER=-
NGINX_HTTP_USER_AGENT=Go-http-client/1.1
NGINX_HTTP_VERSION=1.1
NGINX_METHOD=GET
NGINX_REMOTE_ADDR=1.2.3.4
NGINX_REMOTE_USER=-
NGINX_REQUEST_LENGTH=104
NGINX_REQUEST_TIME=0.001
NGINX_STATUS=200
PRIORITY=6
NGINX_TIME_LOCAL=19/Nov/2023:00:24:43 +0000
NGINX_URL=/index.html
SYSLOG_IDENTIFIER=nginx-log               # <<<<<<<<< THIS HAS BEEN ADDED

Now the message is ready to be sent to a systemd-journal. For this we use systemd-cat-native. This command can send such messages to a journal running on the localhost, a local journal namespace, or a systemd-journal-remote running on another server. By just appending | systemd-cat-native to the command, the message will be sent to the local journal.

# echo '1.2.3.4 - - [19/Nov/2023:00:24:43 +0000] "GET /index.html HTTP/1.1" 200 4172 104 0.001 "-" "Go-http-client/1.1"' | log2journal '^(?<NGINX_REMOTE_ADDR>[^ ]+) - (?<NGINX_REMOTE_USER>[^ ]+) \[(?<NGINX_TIME_LOCAL>[^\]]+)\] "(?<MESSAGE>(?<NGINX_METHOD>[A-Z]+) (?<NGINX_URL>[^ ]+) HTTP/(?<NGINX_HTTP_VERSION>[^"]+))" (?<NGINX_STATUS>\d+) (?<NGINX_BODY_BYTES_SENT>\d+) (?<NGINX_REQUEST_LENGTH>\d+) (?<NGINX_REQUEST_TIME>[\d.]+) "(?<NGINX_HTTP_REFERER>[^"]*)" "(?<NGINX_HTTP_USER_AGENT>[^"]*)"' --duplicate=STATUS2PRIORITY=NGINX_STATUS --inject=SYSLOG_IDENTIFIER=nginx -rewrite=PRIORITY=/^5/3 --rewrite=PRIORITY=/.*/6 | systemd-cat-native
# no output

# let's find the message
# journalctl -o verbose SYSLOG_IDENTIFIER=nginx
Sun 2023-11-19 04:34:06.583912 EET [s=1eb59e7934984104ab3b61f5d9648057;i=115b6d4;b=7282d89d2e6e4299969a6030302ff3e4;m=69b419673;t=60a783417ac72;x=2cec5dde8bf01ee7]
    PRIORITY=6
    _UID=0
    _GID=0
    _BOOT_ID=7282d89d2e6e4299969a6030302ff3e4
    _MACHINE_ID=6b72c55db4f9411dbbb80b70537bf3a8
    _HOSTNAME=costa-xps9500
    _RUNTIME_SCOPE=system
    _TRANSPORT=journal
    _CAP_EFFECTIVE=1ffffffffff
    _AUDIT_LOGINUID=1000
    _AUDIT_SESSION=1
    _SYSTEMD_CGROUP=/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-59780d3d-a3ff-4a82-a6fe-8d17d2261106.scope
    _SYSTEMD_OWNER_UID=1000
    _SYSTEMD_UNIT=user@1000.service
    _SYSTEMD_USER_UNIT=vte-spawn-59780d3d-a3ff-4a82-a6fe-8d17d2261106.scope
    _SYSTEMD_SLICE=user-1000.slice
    _SYSTEMD_USER_SLICE=app-org.gnome.Terminal.slice
    _SYSTEMD_INVOCATION_ID=6195d8c4c6654481ac9a30e9a8622ba1
    _COMM=systemd-cat-nat
    MESSAGE=GET /index.html HTTP/1.1              # <<<<<<<<< CHECK
    NGINX_BODY_BYTES_SENT=4172                    # <<<<<<<<< CHECK
    NGINX_HTTP_REFERER=-                          # <<<<<<<<< CHECK
    NGINX_HTTP_USER_AGENT=Go-http-client/1.1      # <<<<<<<<< CHECK
    NGINX_HTTP_VERSION=1.1                        # <<<<<<<<< CHECK
    NGINX_METHOD=GET                              # <<<<<<<<< CHECK
    NGINX_REMOTE_ADDR=1.2.3.4                     # <<<<<<<<< CHECK
    NGINX_REMOTE_USER=-                           # <<<<<<<<< CHECK
    NGINX_REQUEST_LENGTH=104                      # <<<<<<<<< CHECK
    NGINX_REQUEST_TIME=0.001                      # <<<<<<<<< CHECK
    NGINX_STATUS=200                              # <<<<<<<<< CHECK
    NGINX_TIME_LOCAL=19/Nov/2023:00:24:43 +0000   # <<<<<<<<< CHECK
    NGINX_URL=/index.html                         # <<<<<<<<< CHECK
    SYSLOG_IDENTIFIER=nginx-log                   # <<<<<<<<< CHECK
    _PID=354312
    _SOURCE_REALTIME_TIMESTAMP=1700361246583912

So, the log line, with all its fields parsed, ended up in systemd-journal.

The complete example, would look like the following script. Running this script with parameter test will produce output on the terminal for you to inspect. Unmatched log entries are added to the journal with PRIORITY=1 (ERR_ALERT), so that you can spot them.

We also used the --filename-key of log2journal, which parses the filename when tail switches output between files, and adds the field NGINX_LOG_FILE with the filename each log line comes from.

Finally, the script also adds the field NGINX_STATUS_FAMILY taking values 2xx, 3xx, etc, so that it is easy to find all the logs of a specific status family.

#!/usr/bin/env bash

test=0
last=0
send_or_show='./systemd-cat-native'
[ "${1}" = "test" ] && test=1 && last=100 && send_or_show=cat

pattern='(?x)                          # Enable PCRE2 extended mode
^
(?<NGINX_REMOTE_ADDR>[^ ]+) \s - \s    # NGINX_REMOTE_ADDR
(?<NGINX_REMOTE_USER>[^ ]+) \s         # NGINX_REMOTE_USER
\[
  (?<NGINX_TIME_LOCAL>[^\]]+)          # NGINX_TIME_LOCAL
\]
\s+ "
(?<MESSAGE>                            # MESSAGE
  (?<NGINX_METHOD>[A-Z]+) \s+          # NGINX_METHOD
  (?<NGINX_URL>[^ ]+) \s+              # NGINX_URL
  HTTP/(?<NGINX_HTTP_VERSION>[^"]+)    # NGINX_HTTP_VERSION
)
" \s+
(?<NGINX_STATUS>\d+) \s+               # NGINX_STATUS
(?<NGINX_BODY_BYTES_SENT>\d+) \s+      # NGINX_BODY_BYTES_SENT
"(?<NGINX_HTTP_REFERER>[^"]*)" \s+     # NGINX_HTTP_REFERER
"(?<NGINX_HTTP_USER_AGENT>[^"]*)"      # NGINX_HTTP_USER_AGENT
'

tail -n $last -F /var/log/nginx/*access.log \
	| log2journal "${pattern}" \
		--filename-key=NGINX_LOG_FILE \
		--duplicate=PRIORITY=NGINX_STATUS \
		--duplicate=NGINX_STATUS_FAMILY=NGINX_STATUS \
		--inject=SYSLOG_IDENTIFIER=nginx-log \
		--unmatched-key=MESSAGE \
		--inject-unmatched=PRIORITY=1 \
		--rewrite='PRIORITY=/^5/3 --rewrite=PRIORITY=/.*/6' \
		--rewrite='NGINX_STATUS_FAMILY=/^(?<first_digit>[0-9]).*$/${first_digit}xx' \
		--rewrite='NGINX_STATUS_FAMILY=/^.*$/UNKNOWN' \
		| $send_or_show

log2journal options


Netdata log2journal v1.43.0-276-gfff8d1181

Convert structured log input to systemd Journal Export Format.

Using PCRE2 patterns, extract the fields from structured logs on the standard
input, and generate output according to systemd Journal Export Format.

Usage: ./log2journal [OPTIONS] PATTERN

Options:

  --file /path/to/file.yaml
       Read yaml configuration file for instructions.

  --show-config
       Show the configuration in yaml format before starting the job.
       This is also an easy way to convert command line parameters to yaml.

  --filename-key KEY
       Add a field with KEY as the key and the current filename as value.
       Automatically detects filenames when piped after 'tail -F',
       and tail matches multiple filenames.
       To inject the filename when tailing a single file, use --inject.

  --unmatched-key KEY
       Include unmatched log entries in the output with KEY as the field name.
       Use this to include unmatched entries to the output stream.
       Usually it should be set to --unmatched-key=MESSAGE so that the
       unmatched entry will appear as the log message in the journals.
       Use --inject-unmatched to inject additional fields to unmatched lines.

  --duplicate TARGET=KEY1[,KEY2[,KEY3[,...]]
       Create a new key called TARGET, duplicating the values of the keys
       given. Useful for further processing. When multiple keys are given,
       their values are separated by comma.
       Up to 512 duplications can be given on the command line, and up to
       20 keys per duplication command are allowed.

  --inject LINE
       Inject constant fields to the output (both matched and unmatched logs).
       --inject entries are added to unmatched lines too, when their key is
       not used in --inject-unmatched (--inject-unmatched override --inject).
       Up to 512 fields can be injected.

  --inject-unmatched LINE
       Inject lines into the output for each unmatched log entry.
       Usually, --inject-unmatched=PRIORITY=3 is needed to mark the unmatched
       lines as errors, so that they can easily be spotted in the journals.
       Up to 512 such lines can be injected.

  --rewrite KEY=/SearchPattern/ReplacePattern
       Apply a rewrite rule to the values of a specific key.
       The first character after KEY= is the separator, which should also
       be used between the search pattern and the replacement pattern.
       The search pattern is a PCRE2 regular expression, and the replacement
       pattern supports literals and named capture groups from the search pattern.
       Example:
              --rewrite DATE=/^(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})$/
                             ${day}/${month}/${year}
       This will rewrite dates in the format YYYY-MM-DD to DD/MM/YYYY.

       Only one rewrite rule is applied per key; the sequence of rewrites stops
       for the key once a rule matches it. This allows providing a sequence of
       independent rewriting rules for the same key, matching the different values
       the key may get, and also provide a catch-all rewrite rule at the end of the
       sequence for setting the key value if no other rule matched it.

       The combination of duplicating keys with the values of multiple other keys
       combined with multiple rewrite rules, allows creating complex rules for
       rewriting key values.

       Up to 512 rewriting rules are allowed.

  -h, --help
       Display this help and exit.

  PATTERN
       PATTERN should be a valid PCRE2 regular expression.
       RE2 regular expressions (like the ones usually used in Go applications),
       are usually valid PCRE2 patterns too.
       Regular expressions without named groups are ignored.

The program accepts all parameters as both --option=value and --option value.

The maximum line length accepted is 1048576 characters.
The maximum number of fields in the PCRE2 pattern is 1024.

PIPELINE AND SEQUENCE OF PROCESSING

This is a simple diagram of the pipeline taking place:

           +---------------------------------------------------+
           |                       INPUT                       |
           +---------------------------------------------------+
                            v                          v
           +---------------------------------+         |
           |   EXTRACT FIELDS AND VALUES     |         |
           +---------------------------------+         |
                  v                  v                 |
           +---------------+         |                 |
           |   DUPLICATE   |         |                 |
           | create fields |         |                 |
           |  with values  |         |                 |
           +---------------+         |                 |
                  v                  v                 v
           +---------------------------------+  +--------------+
           |         REWRITE PIPELINES       |  |    INJECT    |
           |        altering the values      |  |   constants  |
           +---------------------------------+  +--------------+
                             v                          v
           +---------------------------------------------------+
           |                       OUTPUT                      |
           +---------------------------------------------------+

JOURNAL FIELDS RULES (enforced by systemd-journald)

     - field names can be up to 64 characters
     - the only allowed field characters are A-Z, 0-9 and underscore
     - the first character of fields cannot be a digit
     - protected journal fields start with underscore:
       * they are accepted by systemd-journal-remote
       * they are NOT accepted by a local systemd-journald

     For best results, always include these fields:

      MESSAGE=TEXT
      The MESSAGE is the body of the log entry.
      This field is what we usually see in our logs.

      PRIORITY=NUMBER
      PRIORITY sets the severity of the log entry.
      0=emerg, 1=alert, 2=crit, 3=err, 4=warn, 5=notice, 6=info, 7=debug
      - Emergency events (0) are usually broadcast to all terminals.
      - Emergency, alert, critical, and error (0-3) are usually colored red.
      - Warning (4) entries are usually colored yellow.
      - Notice (5) entries are usually bold or have a brighter white color.
      - Info (6) entries are the default.
      - Debug (7) entries are usually grayed or dimmed.

      SYSLOG_IDENTIFIER=NAME
      SYSLOG_IDENTIFIER sets the name of application.
      Use something descriptive, like: SYSLOG_IDENTIFIER=nginx-logs

You can find the most common fields at 'man systemd.journal-fields'.

Example YAML file:

--------------------------------------------------------------------------------
# Netdata log2journal Configuration Template
# The following parses nginx log files using the combined format.

# The PCRE2 pattern to match log entries and give names to the fields.
# The journal will have these names, so follow their rules. You can
# initiate an extended PCRE2 pattern by starting the pattern with (?x)
pattern: |
  (?x)                                   # Enable PCRE2 extended mode
  ^
  (?<NGINX_REMOTE_ADDR>[^ ]+) \s - \s    # NGINX_REMOTE_ADDR
  (?<NGINX_REMOTE_USER>[^ ]+) \s         # NGINX_REMOTE_USER
  \[
    (?<NGINX_TIME_LOCAL>[^\]]+)          # NGINX_TIME_LOCAL
  \]
  \s+ "
  (?<MESSAGE>
    (?<NGINX_METHOD>[A-Z]+) \s+          # NGINX_METHOD
    (?<NGINX_URL>[^ ]+) \s+
    HTTP/(?<NGINX_HTTP_VERSION>[^"]+)
  )
  " \s+
  (?<NGINX_STATUS>\d+) \s+               # NGINX_STATUS
  (?<NGINX_BODY_BYTES_SENT>\d+) \s+      # NGINX_BODY_BYTES_SENT
  "(?<NGINX_HTTP_REFERER>[^"]*)" \s+     # NGINX_HTTP_REFERER
  "(?<NGINX_HTTP_USER_AGENT>[^"]*)"      # NGINX_HTTP_USER_AGENT

# When log2journal can detect the filename of each log entry (tail gives it
# only when it tails multiple files), this key will be used to send the
# filename to the journals.
filename:
  key: NGINX_LOG_FILENAME

# Duplicate fields under a different name. You can duplicate multiple fields
# to a new one and then use rewrite rules to change its value.
duplicate:

  # we insert the field PRIORITY as a copy of NGINX_STATUS.
  - key: PRIORITY
    values_of:
    - NGINX_STATUS

  # we inject the field NGINX_STATUS_FAMILY as a copy of NGINX_STATUS.
  - key: NGINX_STATUS_FAMILY
    values_of: 
    - NGINX_STATUS

# Inject constant fields into the journal logs.
inject:
  - key: SYSLOG_IDENTIFIER
    value: "nginx-log"

# Rewrite the value of fields (including the duplicated ones).
# The search pattern can have named groups, and the replace pattern can use
# them as ${name}.
rewrite:
  # PRIORTY is a duplicate of NGINX_STATUS
  # Valid PRIORITIES: 0=emerg, 1=alert, 2=crit, 3=error, 4=warn, 5=notice, 6=info, 7=debug
  - key: "PRIORITY"
    search: "^[123]"
    replace: 6

  - key: "PRIORITY"
    search: "^4"
    replace: 5

  - key: "PRIORITY"
    search: "^5"
    replace: 3

  - key: "PRIORITY"
    search: ".*"
    replace: 4
  
  # NGINX_STATUS_FAMILY is a duplicate of NGINX_STATUS
  - key: "NGINX_STATUS_FAMILY"
    search: "^(?<first_digit>[1-5])"
    replace: "${first_digit}xx"

  - key: "NGINX_STATUS_FAMILY"
    search: ".*"
    replace: "UNKNOWN"

# Control what to do when input logs do not match the main PCRE2 pattern.
unmatched:
  # The journal key to log the PCRE2 error message to.
  # Set this to MESSAGE, so you to see the error in the log.
  key: MESSAGE
  
  # Inject static fields to the unmatched entries.
  # Set PRIORITY=1 (alert) to help you spot unmatched entries in the logs.
  inject:
   - key: PRIORITY
     value: 1

--------------------------------------------------------------------------------

systemd-cat-native options


Netdata systemd-cat-native v1.40.0-1214-gae733dd49

This program reads from its standard input, lines in the format:

KEY1=VALUE1\n
KEY2=VALUE2\n
KEYN=VALUEN\n
\n

and sends them to systemd-journal.

   - Binary journal fields are not accepted at its input
   - Binary journal fields can be generated after newline processing
   - Messages have to be separated by an empty line
   - Keys starting with underscore are not accepted (by journald)
   - Other rules imposed by systemd-journald are imposed (by journald)

Usage:

   systemd-cat-native
          [--newline=STRING]
          [--log-as-netdata|-N]
          [--namespace=NAMESPACE] [--socket=PATH]
          [--url=URL [--key=FILENAME] [--cert=FILENAME] [--trust=FILENAME|all]]

The program has the following modes of logging:

  * Log to a local systemd-journald or stderr

    This is the default mode. If systemd-journald is available, logs will be
    sent to systemd, otherwise logs will be printed on stderr, using logfmt
    formatting. Options --socket and --namespace are available to configure
    the journal destination:

      --socket=PATH
        The path of a systemd-journald UNIX socket.
        The program will use the default systemd-journald socket when this
        option is not used.

      --namespace=NAMESPACE
        The name of a configured and running systemd-journald namespace.
        The program will produce the socket path based on its internal
        defaults, to send the messages to the systemd journal namespace.

  * Log as Netdata, enabled with --log-as-netdata or -N

    In this mode the program uses environment variables set by Netdata for
    the log destination. Only log fields defined by Netdata are accepted.
    If the environment variables expected by Netdata are not found, it
    falls back to stderr logging in logfmt format.

  * Log to a systemd-journal-remote TCP socket, enabled with --url=URL

    In this mode, the program will directly sent logs to a remote systemd
    journal (systemd-journal-remote expected at the destination)
    This mode is available even when the local system does not support
    systemd, or even it is not Linux, allowing a remote Linux systemd
    journald to become the logs database of the local system.

    Unfortunately systemd-journal-remote does not accept compressed
    data over the network, so the stream will be uncompressed.

      --url=URL
        The destination systemd-journal-remote address and port, similarly
        to what /etc/systemd/journal-upload.conf accepts.
        Usually it is in the form: https://ip.address:19532
        Both http and https URLs are accepted. When using https, the
        following additional options are accepted:

      --key=FILENAME
        The filename of the private key of the server.
        The default is: /etc/ssl/private/journal-upload.pem

      --cert=FILENAME
        The filename of the public key of the server.
        The default is: /etc/ssl/certs/journal-upload.pem

      --trust=FILENAME | all
        The filename of the trusted CA public key.
        The default is: /etc/ssl/ca/trusted.pem
        The keyword 'all' can be used to trust all CAs.

      --keep-trying
        Keep trying to send the message, if the remote journal is not there.

    NEWLINES PROCESSING
    systemd-journal logs entries may have newlines in them. However the
    Journal Export Format uses binary formatted data to achieve this,
    making it hard for text processing.

    To overcome this limitation, this program allows single-line text
    formatted values at its input, to be binary formatted multi-line Journal
    Export Format at its output.

    To achieve that it allows replacing a given string to a newline.
    The parameter --newline=STRING allows setting the string to be replaced
    with newlines.

    For example by setting --newline='{NEWLINE}', the program will replace
    all occurrences of {NEWLINE} with the newline character, within each
    VALUE of the KEY=VALUE lines. Once this this done, the program will
    switch the field to the binary Journal Export Format before sending the
    log event to systemd-journal.