Loki Best Practices

This article refers to “Loki Label Best Practice”, and combined with xiaobai’s actual work experience summed up, wrong place please forgive me.

1. Use static labels

Using static tags allows for less overhead when logging. When the label is injected before the log is sent to Loki, common recommended static labels include:

For dedicated servers: kubernetes/hosts
Application name: kubernetes/labels/ app_KUbernetes_IO /name
Component name: kubernetes/labels/name
Namespace: kubernetes/namespace
Other kubernetes/label/* static labels, such as environment, version, and so on

2. Use dynamic tags with caution

Too many tag combinations can cause too many streams, which can cause Loki to store too many indexes and small object files. These can significantly consume Loki query performance. To avoid these problems, don’t add tags until you know you need them! Loki has the advantage of parallel query, use the filter expression (lable = “text”, | ~ “regex”,…). To query logs is more efficient and faster.

So, when should I add tags?

Chunk_target_size defaults to 1MB, and Loki will split the log block with 1MB of compressed size, which is roughly equal to 5MB of the original log file (depending on the compression level you configured). If your log flow is sufficient to generate one or more compressed blocks within the max_CHUNk_age time, you may want to consider adding labels to make the log flow smaller. Starting with Loki 1.4.0, there is a metric that helps us understand the log block refresh

sum by (reason) (rate(loki_ingester_chunks_flushed_total{cluster="dev"}[1m]))
Copy the code

3. A bounded range of label values

At the end of the day, however, if you have to use dynamic tags, you should be careful to control the range of tags and the length of the value. For example, if you want to extract some fields from the nginx access log and store them in Loki,

{" @ timestamp ":" the 2020-09-30 T12:16:07 + 08:00, "@" source ":" 172.16.1.1 ", "hostname" : "node1", "IP" : "-", "client" : "172.16.2.1", "requ est_method":"GET","scheme":"https","domain":"xxx.com","referer":"-","request":"/api/v1/asset/asset?page_size=-1&group=23 ","args":"page_size=-1&group=23","size":975,"status": 200, "responsetime" : 0.065, "upstreamtime" : "0.064", "upstreamaddr" : "172.16.3.1:8080", "http_user_agent" : "python - requests / 2.22 .0","https":"on"}Copy the code

@source represents the source address of the client. Since the source address is a public address, the value of the Loki tag is unbounded. Let’s say @request is the URL of the request. Some request parameters may be too long, and the Loki tag value may be too large. If you multiply the two, the size of the label is unacceptable.

The above situation is a typical unbounded dynamic label value, in Loki we use Cardinality to express it, the higher the Cardinality value, the lower the query efficiency of Loki. The range of dynamic tags given by the Loki community should be limited to 10.

4. Dynamic labels applied by the client

Several of Loki’s clients (Promtail, Fluentd, Fluent Bit, Docker plug-in, etc.) come with methods for creating log flows with configuration tags. Sometimes we need to identify which applications in Loki are using dynamic tags, and we can use the LogCLI tool to help us. In Loki1.6.0 and later, the logcli series command added the –analyze-labels argument specifically for debugging high cardinality labels. Such as:

$ logcli series --analyze-labels '{app="nginx"}'

Total Streams:  25017
Unique Labels:  8

Label Name  Unique Values  Found In Streams
requestId   24653          24979
logStream   1194           25016
logGroup    140            25016
accountId   13             25016
logger      1              25017
source      1              25016
transport   1              25017
format      1              25017
Copy the code

You can see that there are 24,653 values for the requestId tag, which is very bad. We should delete requestId from the label and query this way

{app="nginx"} |= "requestId=1234567"
Copy the code

5. Configure the cache

For more information on Loki caching, see Xiaobai’s previous article “Using Cache to speed up Loki queries”.

The application of caching Loki is flexible. You can have a common cache for all Loki components, or you can have each Loki component use its own cache, as described in the previous article on distributed deployment of Loki

6. The log time must be in ascending order

If a log flow has a timestamp earlier than the latest log received by the stream, the log will be deleted

{job= "syslog"} 00:00:00 I'm a syslog! {job= "syslog"} 00:00:02 I'm a syslog! {job= "syslog"} 00:00:01 I'm a syslog! This log will be deletedCopy the code

If your service is distributed across multiple nodes and there is a time lag, you will have to add a new label for the log to be stored

{job= "syslog", instance= "host1"} 00:00:00 I'm a syslog! \\ New log flow 1 {job= "syslog", instance= "host1"} 00:00:02 I'm a syslog! {job= "syslog", instance= "host2"} 00:00:01 I'm a syslog! \\ New log flow 2 {job= "syslog", instance= "host1"} 00:00:03 I'm a syslog! \\ time order in log flow 1 {job= "syslog", instance= "host2"} 00:00:02 I'm a syslog! \\ Time order in log stream 2Copy the code

This is nothing to say, but it is recommended to add a timestamp for each log to the client time when collecting logs. If your timestamp is extracted from the app log and the time is out of order, please fix your app first 😂

7. Use the chunk_target_size parameter

As mentioned above, chunk_TARGEt_size effectively compresses log streams to a reasonable size, with Loki containing one block per log stream. If we split the log file into more streams, the more blocks are stored in memory and there is a theoretical risk of losing the log before it is flushed to disk. In this case, you need to combine max_CHUNk_age (1h) and CHUNk_IDLE_period (30M) to control the timeout period for refreshing logs.

8. Run the -print-config-stderr or -log-config-reverse-order command

As of version 1.6.0, Loki and Promtail support such parameters. When started, Loki prints the entire configuration information to a stderr or log file. This way we can quickly see the entire Loki configuration for easy debugging.

When the -log-config-reverse-order parameter is enabled, we will look at logs sequentially when querying Loki on grafna, which makes it a little easier.

9. Use the query – frontend

Query-frontend effectively splits log queries into smaller queries and distributes them to Querier for concurrent execution. This greatly improves the efficiency of Loki queries. In theory, you can expand hundreds of Queriers to concurrently process GIGABytes or TERabytes of logs, but only if your query client can handle them. 😃

About cloud native small white

Cloud native xiao Bai’s creation purpose is to be far away from everyone in daily life cloud native application with a practical point of view, standing in the perspective of small white to look at and use cloud native, and to solve a practical problem with each article to lead everyone into the starting point of cloud native.