Use ClickHouse for offline semi-structured logging

In Get Started with ClickHouse, we introduced the basics of using ClickHouse. In this article, you will learn how to use Clickhouse to quickly process semi-structured offline data, such as Nginx run logs, so that static data can be quickly queried and analyzed.

Writing in the front

Such as Nginx and other Web servers, MySQL and other databases, will produce a lot of running data in the daily operation process. Without special Settings, we will get some plain text content with a certain format, which is generally called semi-structured content, for example:

xxx.xxx.xxx.xxx - - [01/Jan/2021:00:02:03 +0000] "GET/feed HTTP / 1.1"301, 169,"-" "Winds: Open Source RSS & Podcast app: https://getstream.io/winds/"
xxx.xxx.xxx.xxx - - [01/Jan/2021:00:02:34 +0000] "The GET/feed/HTTP / 1.1" 304 0 "-" "Tiny Tiny RSS / 20.08 5497 a13 (http://tt-rss.org/)"
xxx.xxx.xxx.xxx - - [01/Jan/2021:00:02:54 +0000] "The GET/feed/HTTP / 1.1" 304 0 "-" "Tiny Tiny RSS / 20.11 242 aa6e41 (http://tt-rss.org/)"
Copy the code

This data is not fully structured and cannot be parsed and used by ClickHouse, making it impossible to do fast data analysis.

Open source community existing solutions

Several solutions have emerged in the open source community to address these issues. We have the following options:

Clicktail: an open source tool that can structure semi-structured MySQL and Nginx logs and then send them directly to the ClickHouse Client during streaming processing.
Solution 2 nginx-ClickHouse: Nginx run logs are structured and sent to ClickHouse for processing, and provides a performance indicator interface for Prometheus to monitor fetching.
Seslog: a server that is compatible with the Nginx logging protocol and can convert data and write to Clickhouse.
Solution 4: Nginx 2 ClickHouse: a small tool written by C, based on string matching replacement, can transform Nginx logs, easy to import ClickHouse manually.
Rsyslog-nginx-clickhouse: Strictly speaking, this is not a tool, but a document. Describes how to configure Nginx and convert it using Rsyslog and then store it to ClickHouse.

These schemes can be classified into the following three types:

I have a stock of ready-made data and need to use structured data conversion first and then manually import ClickHouse: Option four
I need to process real-time production of Nginx data: plan 1, Plan 2, Plan 3, Plan 5
If I want to deal with semi-structured data that is not produced by Nginx, such as MySQL: only plan 1

Existing problems and highlights of the existing scheme

However, the birth of these solutions is based on a very fixed scenario and audience, so there are some problems in the scenario mentioned in our article (general, suitable for batch offline data processing) :

The developers have abandoned the seemingly universal plan one. The code is old, some dependencies are missing, and there are refactoring costs if you want to customize and develop.
For solution 2, 3, and 5, log replay is required, which is inefficient and complicated. Moreover, none of these schemes can handle data other than Nginx logs, so their generality is poor.
4, because the program logic is fixed, so it requires that Nginx fields need to be reconfigured, that is, if the convention rules are not carried out, data conversion, analysis cannot be carried out. And this program can only handle Nginx logs, the lack of generality.

Although the above schemes have some problems of unsuitable scenarios, each of them has some advantages:

From plan 1 to Plan 4, use languages with high performance thresholds (Golang and C) to write tools, and the tool execution efficiency is higher than that of scripting languages.
Scheme 4: Stateless tool usage scenarios most meet our requirements, suitable for offline batch processing, to solve the analysis and import of large-size logs.
Scheme 5: Source logs can be flexibly configured and scenarios are adaptable. The Rsyslog component liblognorm is suitable for use as a transporter in the scenario.

The stone of other mountains can attack jade

After a clear understanding of the problems and advantages of the above solution, we can simply implement our tool solution. Preliminary ideas:

Using Linuxcat 、grep,awk 、uniqAnd other commands to quickly preprocess data.
Use the liblognorm componentlognormalizerParse and convert logs to JSON format.
Write a small script that takes the data from the previous step further and corrects data that doesn’t match expectations, such as the time.
ClickHouse Client is used to import data directly, adjust data fields, and then analyze them.

Step 1: Data preprocessing

The basic processing of logs using Linux commands is covered in the log analysis processing: GoAccess article, which is not covered here. Those of you who are interested can browse through previous articles.

Step 2: Use LogNormalizer to structure logs

I have uploaded the relevant codes of this part to GitHub, and you can help yourself if necessary.

Lognormalizer is included in the liblognorm component of Rsyslog. It can be configured to quickly sequence various semi-structured content into the desired structured format.

In contrast to the upstarts of recent years, officials have not actively defended its release, from repology.org/project/lib… As you can see, the latest version is only available out of the box on distributions of Fedora, OpenBSD, and Alpine.

** So, if we want absolute high performance, we need to compile ourselves. ** But since it ships the Alpine version of binaries, perhaps using containers is an easy and enjoyable choice. Before introducing the container scenario, let’s talk a little bit more about the compilation process.

The installation section in the official documentation is somewhat dated and needs to be updated, such as using libfastjson to replace the JSON-c link library, using Rsyslog installation dependency to complete the deployment of the base build environment, etc. Maybe I will move it to MacOS later when I have time. Convenient for daily use.

To enjoy using the latest version of LogNormalizer, we need to prepare a container image.

FROM Alpine :3.14 ENV LANG en_us.utf-8 ENV LANGUAGE en_us.utf-8 ENV LC_ALL= en_us.utf-8 RUN apk --no-cache --no-progress update && \ apk --no-cache --no-progress upgrade RUN apk add liblognormCopy the code

For future use, I have pushed this image to DockerHub. You can download and use it directly by using the following command:

Docker pull soulteary/lognormalizer: 2.0.6Copy the code

To verify that LogNormalizer works, prepare a small amount of log data, write a simple parsing configuration, create an application container, and try to use it for data conversion.

We save the following as parser.rule (configured from the rsyslog-nginx-clickhouse scheme) :

version=2

rule=:%clientip:word% %ident:word% %auth:word% [%day:char-to:/%/%month:char-to:/%/%year:number%:%rtime:word% %tz:char-to:]%] "%verb:word% %request:word% HTTP/%httpversion:float%" %response:number% %bytes:number% "%referrer:char-to:"%""%agent:char-to:"%"%blob:rest%
Copy the code

Then extract the first three lines of the log file as test data:

head -n 3 logs/2021.log > sample.txt
Copy the code

Next, start a test container and mount the configuration and test log data mentioned above:

docker run --rm -it -v `pwd`/sample.txt:/sample.txt -v `pwd` / parser. Rule: / parser. Rule soulteary/lognormalizer: 2.0.6 shCopy the code

Manually load the data and pipe it to LogNormalizer for structured output. Set the output format to JSON:

cat sample.txt | lognormalizer -r parser.rule -e json
Copy the code

Not surprisingly, we should get output similar to the following:

{ "blob": ""."agent": "Tiny Tiny RSS \ / 20.08 5497 a13 \ / (http:\/\/tt-rss.org)"."referrer": "-"."bytes": "0"."response": "304"."httpversion": "1.1"."request": "\/feed\/"."verb": "GET"."tz": "+ 0000"."rtime": "00:02:34"."year": "2021"."month": "Jan"."day": "01"."auth": "-"."ident": "-"."clientip": "xxx.xxx.xxx.xxx"}...Copy the code

The above content, doesn’t it seem to have done enough? But on closer inspection, there is something wrong with the timing: “month”: “Jan”, the default timing in the Nginx log, is not suitable for our subsequent structured queries in ClickHouse, so we need to consider taking it a step further and treating it as a “number”.

Step 3: Write the basic transformation data

For the sake of efficiency and simplicity, we can use Golang to write a small tool to help us finish off the data in step 2. Considering the execution efficiency, we can also transfer the data in the intermediate process directly through Linux Pipeline without falling off disk.

To implement a simple program that reads data through a “pipe” and then replaces the 12-month month string name with a “number”, consider the following:

package main

import (
	"bufio"
	"fmt"
	"os"
	"strings"
)

func main() {
	monthLabels := []string{"\"Jan\""."\"Feb\""."\"Mar\""."\"Apr\""."\"May\""."\"Jun\""."\"Jul\""."\"Aug\""."\"Sep\""."\"Oct\""."\"Nov\""."\"Dec\""}
	monthValues := []string{"\ \" "01".02 '\' \ ""."\" 03 \ ""."\" 4 \ ""."\" 05 \ ""."\" 06 \ "".07 "\" \ ""."\" 08 \ ""."\" 09 \ "".10 \ \ "" ""."\" and \ "".12 \ \ "" ""}

	src, _ := bufio.NewReader(os.Stdin).ReadString('\n')
	for i, monthLabel := range monthLabels {
		if strings.Contains(src, monthLabel) {
			dest := strings.Replace(src, monthLabel, monthValues[i], -1)
			fmt.Println(dest)
			return
		}
	}
	fmt.Println("")}Copy the code

Let’s save the code as test.go and test it using the data from step 2 and a simple command:

echo '{ "blob": ""."agent": "Tiny Tiny RSS \ / 20.08 5497 a13 \ / (http:\/\/tt-rss.org)"."referrer": "-"."bytes": "0"."response": "304"."httpversion": "1.1"."request": "\/feed\/"."verb": "GET"."tz": "+ 0000"."rtime": "00:02:34"."year": "2021"."month": "Jan"."day": "01"."auth": "-"."ident": "-"."clientip": "xxx.xxx.xxx.xxx" }' | go run test.go 
Copy the code

After executing the command, we should see that the data has changed to what we expected, and the month has been converted to a number correctly:

{ "blob": ""."agent": "Tiny Tiny RSS \ / 20.08 5497 a13 \ / (http:\/\/tt-rss.org)"."referrer": "-"."bytes": "0"."response": "304"."httpversion": "1.1"."request": "\/feed\/"."verb": "GET"."tz": "+ 0000"."rtime": "00:02:34"."year": "2021"."month": "01"."day": "01"."auth": "-"."ident": "-"."clientip": "xxx.xxx.xxx.xxx" }
Copy the code

If you’re familiar with Golang, you’ll notice the problem with this example code: only one piece of data can be processed at a time. So let’s go ahead and make some simple tweaks to it:

package main

import (
	"bufio"
	"fmt"
	"os"
	"strings"
)

func main() {
	monthLabels := []string{"\"Jan\""."\"Feb\""."\"Mar\""."\"Apr\""."\"May\""."\"Jun\""."\"Jul\""."\"Aug\""."\"Sep\""."\"Oct\""."\"Nov\""."\"Dec\""}
	monthValues := []string{"\ \" "01".02 '\' \ ""."\" 03 \ ""."\" 4 \ ""."\" 05 \ ""."\" 06 \ "".07 "\" \ ""."\" 08 \ ""."\" 09 \ "".10 \ \ "" ""."\" and \ "".12 \ \ "" ""}

	scanner := bufio.NewScanner(os.Stdin)
	for scanner.Scan() {
		src := scanner.Text()
		for i, monthLabel := range monthLabels {
			if strings.Contains(src, monthLabel) {
				dest := strings.Replace(src, monthLabel, monthValues[i], -1)
				fmt.Println(dest)
				break}}}}Copy the code

With a few tweaks, the gadget works well with Linux tools like CAT for streaming large files. Further optimizations can be made if you wish, such as splitting worker threads. Take the above sample code as an example, processing less than 2G files, about 9s, if optimized, should be able to reduce the time consumption to 1 ~ 3s, interested friends can try.

Since this tool needs to be used with the LogNormalizer in step 2, we can reuse the image created in step 2 when building containers.

The FROM soulteary/lognormalizer: 2.0.6 AS Builder RUN apk add go COPY datepatcher. Go/RUN go build datepatcher. Go the FROM Soulteary/lognormalizer: 2.0.6 COPY - from = Builder/datepatcher/bin /Copy the code

For ease of use, I also uploaded the image to DockerHub:

Docker pull soulteary/lognormalizer - with - datapatcher: 2.0.6Copy the code

This part of the code, I also uploaded to GitHub, can be adjusted according to actual needs. After the transformation tool image is ready, we can write a container choreography configuration that uses this image:

version: "2"

services:

  converter:
    image: Soulteary/lognormalizer - with - datapatcher: 2.0.6
    container_name: lognormalizer
    command: tail -f /etc/hostname
    environment:
      - TZ=Asia/Shanghai
    volumes:
      - ./logs:/logs
    tmpfs:
      - /cache
Copy the code

Docker-compose down && docker-compose up -d start the mirror, place the logs we need to process and the parser.rule mentioned above in the logs directory, and then execute the command to convert the logs.

Consider using TmpFS to speed up data reads and writes, such as moving the data files to the TmpFS directory (at least 15 times faster) :

docker exec -it lognormalizer sh -c "cp /logs/2021.log /cache/2021.log"
Copy the code

Adjust the following commands according to your file location, then start converting and wait for the calculation to complete.

docker exec -it lognormalizer sh -c "cat /cache/2021.log | lognormalizer -r /logs/parser.rule -e json | datepatcher > /cache/data.log"
Copy the code

Finally, use docker CP or move the calculated file to our mounted directory to be ready to import ClickHouse.

To save space, you can also clean up the container we just created:

docker kill lognormalizer
docker rm lognormalizer
Copy the code

Step 4: Import ClickHouse

You can choose to build a ClickHouse example as described in Get Started with ClickHouse, or use the ClickHouse database in the cloud. Let’s expand on the container as an example.

Use Docker exec-it clickhouse clickhouse-client to access the ClickHouse interactive console and create a database:

CREATE DATABASE IF NOT EXISTS soulteary
Copy the code

Then create a temporary table to hold our log data.

CREATE TABLE IF NOT EXISTS soulteary.tmp
(
    `year` String,
    `month` String,
    `day` String,
    `rtime` String,
    `clientip` String,
    `ident` String,
    `auth` String,
    `verb` String,
    `request` String,
    `httpversion` String,
    `response` UInt16,
    `bytes` UInt64,
    `referrer` String,
    `agent` String,
    `blob` String
)
ENGINE = MergeTree
ORDER BY (year.month.day, rtime)
Copy the code

Some of you may wonder why you need to create a temporary table when the table structure looks perfect.

This is because in the first few steps converting the data to “date” and “time” field, if we want to quick sort or statistical analysis based on time dimension is, whether it is based on the foundation of “digital” type, type is based on the foundation of the “text”, in the process of analysis will be relatively inefficient, So we need to consider programmatically generating these two fields. What about the table structure of a formal table?

CREATE TABLE IF NOT EXISTS soulteary.nginx
(
    `date` Date,
    `time` DateTime,
    `year` UInt16,
    `month` UInt16,
    `day` UInt16,
    `rtime` String,
    `clientip` String,
    `ident` String,
    `auth` String,
    `verb` String,
    `request` String,
    `httpversion` String,
    `response` UInt16,
    `bytes` UInt64,
    `referrer` String,
    `agent` String,
    `blob` String
)
ENGINE = MergeTree
ORDER BY (year.month.day, rtime)
SETTINGS index_granularity = 8192
Copy the code

As you can see, in the official table, we have made some changes to the field type, in addition to the addition of date and time fields.

Exit the interactive console, use the Docker exec-it clickhouse clickhouse-client to enter the ClickHouse container, and execute commands to formally import the data from the previous three steps into clickhouse.

cat /var/lib/clickhouse/data.log | clickhouse-client -n --query="SET input_format_skip_unknown_fields=1; INSERT INTO soulteary.tmp FORMAT JSONEachRow"
Copy the code

Wait a moment, execute the Clickhouse-client command to enter the Clickhouse interactive console again, and then execute the statement to start the temporary table to formal table to data migration, and field completion:

INSERT INTO soulteary.nginx SELECT
    toDateTime(concat(year.The '-'.month.The '-'.day.' ', rtime)) AS fullDateLabel,
    toDate(fullDateLabel),
    year.month.day,
    rtime,
    clientip,
    ident,
    auth,
    verb,
    request,
    httpversion,
    response,
    bytes,
    referrer,
    agent,
    blob
FROM soulteary.tmp
Copy the code

When the command is executed, output similar to the following is displayed:

Query id: 43a1fcbc- 1078.4 -add-b6f77 -c425085ca4b

Ok.

0 rows in set. Elapsed: 4.848 sec. Processed 3.71 million rows.1.09 GB (764.84 thousand rows/s., 225.44 MB/s.)
Copy the code

Step 5: Check the data

After the data migration is complete, we can use the command to check to ensure that the data migration is complete.

SELECT
    prod,
    tmp
FROM
(
    SELECT count(a)AS prod
    FROM soulteary.nginx
) AS table_prod,
(
    SELECT count(a)AS tmp
    FROM soulteary.tmp
) AS table_tmp

Query id: 2415f48b- 39f24 -bcd- 8 -c5b-0971e4Ada6a3 ┌ ─ ─ ─ ─ the prod ─ ┬ ─ ─ ─ ─ ─ TMP ─ ┐ │3707776 │ 3707776│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘1 rows in set. Elapsed: 0.004 sec. 
Copy the code

Then manually execute a specific query:

SELECT *
FROM soulteary.nginx
LIMIT 1┌ ─ ─ ─ ─ ─ ─ ─date─ ┬ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─time─ ┬ ─year─ ┬ ─month─ ┬ ─day─ ┬ ─ rtime ─ ─ ─ ─ ┬ ─ clientip ─ ─ ─ ─ ─ ─ ─ ┬ ─ the ident ─ ┬ ─ auth ─ ┬ ─ ─ verb ┬ ─ request ─ ┬ ─ httpversion ─ ┬ ─ response ─ ┬ ─ bytes ─ ┬ ─ referrer ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ agent ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─blob─ ┐ │2021- 01- 01 │ 2021- 01- 01 00:00:00 │ 2021 │     1 │   1 │ 00:00:24 │ 111.23178.131. │ -     │ -    │ GET  │ /feed/  │ 1.1         │      200 │ 15130 │ https://soulteary.com/feed/ │ FreshRSS/1.163.-dev (Linux; https://Freshrss.org) │ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┘Copy the code

You can see that the data is correctly migrated to the official table and that the missing date and time fields are correctly completed.

The last

This concludes the topic of how to use Clickhouse to handle offline semi-structured data.

Maybe in the next ClickHouse article, I’ll pick up where I left off with “Private Cloud Environments in Laptops: Monitoring” and talk about integrating Grafana for visualization.

–EOF

We have a little group of hundreds of people who like to do things.

In the case of no advertisement, we will talk about software and hardware, HomeLab and programming problems together, and also share some information of technical salon irregularly in the group.

Like to toss small partners welcome to scan code to add friends. (To add friends, please note your real name, source and purpose, otherwise it will not be approved)

All this stuff about getting into groups

If you think the content is still practical, welcome to share it with your friends. Thank you.

This article is published under a SIGNATURE 4.0 International (CC BY 4.0) license. Signature 4.0 International (CC BY 4.0)

Author: Su Yang

Creation time: November 11, 2021 statistical word count: 11408 words reading time: 23 minutes to read this article links: soulteary.com/2021/11/11/…