Kubernetes log collection, storage and processing technology practice

Introduction: In Kubernetes service-oriented, real-time log processing and centralized log storage trend, Kubernetes log processing also encountered new challenges, including: container dynamic collection, large traffic performance bottleneck, log routing management and other problems.

This article introduces the architecture of Logtail + Log service + ecology. It describes the advantages of Logtail client in the Kubernetes log collection scenario. Log service as a one-stop infrastructure to solve real-time read and write, HTAP log two strong requirements. The openness of log service data and the combination of cloud products and open source community provide users with rich choices in real-time computing, visualization and collection.

Trends and challenges in Kubernetes log processing

Kubernetes serveless

Kubernetes container technology promotes stack decoupling by introducing stack layering so that developers can focus more on their own applications and business scenarios. From Kubernetes’ point of view, this technological decoupling is further developed, and one trend of containerization is that these containers will run on serverless infrastructures.

Speaking of infrastructure, we can think of cloud first. Currently, serverless Kubernetes services are provided on AWS, Ali Cloud and Azure cloud. On Serverless Kubernetes, we will no longer care about clusters and machines, just declare the container’s image, CPU, memory, and external service mode to start the application.

As shown in the picture above, the left and right sides are respectively the form of classic Kubernetes and Serverless Kubernetes. Log collection also becomes complicated as it moves from left to right:

On a Kubernetes node, it is possible to run pods of a larger magnitude, each of which may have logging or monitoring metrics collection requirements, meaning a larger log volume on a single node.
A Kubernetes node may run more types of POD, log collection sources become diversified, log management, marking needs become more and more urgent.

The demand for real-time logging is becoming stronger and stronger

First of all, not all logs need to be processed in real time, and many “T+1” log deliveries are still very important. For example, a day delay for BI may be sufficient, or a one hour delay for CTR estimates may be acceptable.

However, in some scenarios, second-level or more time-sensitive logging is a prerequisite. The horizontal axis below, from left to right, illustrates the importance of real-time data for decision making.

Here are two scenarios to illustrate the importance of real-time logging for decision making:

Alarm handling. As anyone who’s devops knows, the sooner we find and deal with online problems, the more calm we’ll be, and the longer we wait to deal with them, the more likely they’ll escalate. Real-time logs help us quickly discover abnormal indicators in the system and trigger emergency handling processes.
AIOps. At present, there are some algorithms for anomaly detection and trend prediction based on logs. According to the change trend of log pattern, the distribution of various types of logs can be found under normal and abnormal conditions. For IT business system, given parameter factors, variables, such as hard disk fault modeling, real-time log to realize the early warning of fault events.

Centralized log storage

There are many log sources, common ones are: files, database Audit log, network packets and so on. In addition, for the same data, different users (such as development, o&M, and operation) and different purposes (such as alarm, data cleaning, real-time search, and batch computing) may consume log data repeatedly in multiple ways.

In the system integration of log data, it can be defined as a pipeline from data source to storage node and then to compute node. In charge, log processing is evolving from O(N^2) to O(N) pipelines.

In the past, all kinds of logs were stored in a specific way, the collected computing links did not have the conditions to be common and reusable, pipelines were very complex, and data storage might be redundant. In current log data integration, relying on a Hub simplifies the log architecture and optimizes storage utilization. This infrastructure-level Hub is important because it needs to support real-time PUB /sub, handle highly concurrent write and read requests, and provide massive storage space.

Evolution of the Kubernetes log collection scheme

The previous section summarized the trends in Kubernetes log processing, so we will take a look at several common log collection practices on Kubernetes.

Command line tool

The basic way to view the logs on the Kubernetes cluster is to log on the machine and run kubectl logs to see the stdout/stderr written by the container.

Basic solutions don’t meet more needs:

Only standard output is supported
Data cannot be persisted
Can’t do anything but look

The node log file falls to the disk

The Docker Engine redirects the container’s stdout/stderr to LogDriver, and logDriver can be configured to persist logs in various forms, such as json files to local storage.

Compared to the Kubectl command line, one step further is to store logs locally. You can use Linux tools like grep/awk to analyze log file contents.

This solution is equivalent to a return to the era of physical machines, but there are still many problems left unsolved:

Based on Docker Engine and LogDriver, application logging is not supported beyond the default standard output
The data in the log file is lost after the log file is rotated several times or the Kubernetes node is ejected
Can’t integrate with open source communities, data analytics tools in the cloud

An evolution based on this solution is to deploy a log collection client on node to upload logs to a centralized log storage facility. This is currently the recommended pattern and will be covered in the next section.

Sidecar mode logging client collection is a companion mode in which there is a logging client container in addition to the business container within a POD. The logging client container is responsible for collecting standard output, file, and metric data from the POD container and reporting it to the server.

This solution addresses basic functional requirements such as persistent log storage, but there are two areas for improvement:

If N PODS are running on a node, N log clients are running at the same time, wasting resources such as CPU, memory, and port.

In Kubernetes, collection configuration (log collection directory, collection rule, storage target, etc.) needs to be carried out separately for each POD, which is difficult to maintain.

Log direct writing

The direct write scheme is typically implemented by modifying the application itself, organizing several logs within the program, and then calling something like an HTTP API to send the data to the log storage back end.

The benefits are that the log format can be DIY on demand, and the routing of the log source and destination can be arbitrarily configured.

You can also see the limitations of use:

Hacking code directly depends on business transformation, and driving business transformation generally takes a long time.

When the application program encounters an exception (such as network jitter or an internal error on the receiving server) when sending data to the remote end, it needs to cache the data in the limited memory and retry. Data loss may occur eventually.

Kubernetes log processing architecture

Architecture from the community

In the most common architectures, the collection is done by installing a logging client on each Kubernetes node:

Kubernetes officially recommends Treasure Data’s open source Fluentd, which offers balanced performance and plug-in richness.
The community is also using Elastic’s open source Beats series, which has decent performance and slightly less plugin support. Specific client programs need to be deployed for one type of data, such as text files collected by FileBeats.
There are also architectures that use Logstash. ETL support is very rich, but the JRuby implementation leads to poor performance.

The logging client formats the data and uplots it to the storage using a specified protocol, such as Kafka. Kafka supports real-time subscriptions, repeated consumption, and the ability to synchronize data to other systems based on your business needs. For example, business logs can be used for keyword Search and Kibana log visualization. For long-term retention of financial scenario logs, you can choose to post Kafka data to cost-effective storage such as AWS S3.

The architecture looks neat and efficient, but Kubernetes still has some details to work out:

First of all, this is a standard node-level collection scheme. The program deployment and collection configuration management of fluentd and other clients under Kubernetes are difficult. There is no targeted optimization in log collection routing, data marking, client configuration and other issues.
Second, in terms of log consumption, Kafka’s software ecosystem, while rich enough, still requires professional maintenance, business planning, consideration of machine water levels, and dealing with hardware damage. If you want to query analysis logs, you need to have a good understanding of the Elastic Search system. We know that in the pB-level data scenario, the performance and operation and maintenance problems of distributed systems begin to become prominent, and it requires strong professional ability to control these open source systems.

Kubernetes logging architecture practices for logging services

We propose Kubernetes log processing architecture based on Ali Cloud log service to supplement the community’s scheme and try to solve some details of log processing experience problems in the Kubernetes scenario. This architecture can be summed up as “Logtail + logging service + ecosystem”.

First, Logtail is the data collection client of the log service, which is designed for some pain points in the Kubernetes scenario. As recommended by Kubernetes, only one Logtail client is deployed on each node to collect all pod logs on that node.

Secondly, to meet the two basic log requirements of keyword search and SQL statistics, the log service provides the basic LogHub function to support real-time data writing and subscription. On the basis of LogHub storage, you can enable the index analysis function of data. After the index function is enabled, you can query log keywords and analyze SQL syntax.

Finally, the logging service’s data is open. Index data can be connected to third-party systems through JDBC protocol. SQL query results can be easily integrated with Grafana systems such as Ali Cloud DataV and open source community. The high throughput real-time read and write capability of log service supports the connection with streaming computing system. Spark Streaming, Blink, JStorm and other streaming computing systems have connector support.

Users can also use the fully managed post function to write data to the object storage OSS of Ali Cloud, which supports row storage (CSV, JSON) and column storage (Parquet) formats. These data can be used as long-term low-cost backup, or data warehouse can be realized through the “OSS storage +E-MapReduce computing” architecture.

Advantages of the log service

Describe the characteristics of the logging service from four points:

In terms of reliability and stability, it has supported alibaba Group and Ant Financial for several double 11 and Double 12 promotion.
Kafka + ElasticSearch is a feature that covers most scenarios.
As the infrastructure on the cloud can easily achieve elastic expansion, for users, when promoting, there is no need to worry about buying machines to expand, broken dozens of disks need to repair every day and other problems.
The logging service also has the zero-up-front, pay-as-you-go features of the cloud, and is currently free at 500MB per month.

Reviewing the trends and challenges of Kubernetes logging processing mentioned in section 1, we summarize three advantages of logging services:

As a log infrastructure, it solves the problem of centralized storage of log data.
Servitization products bring users more ease of use, and Kubernetes in serverless is also consistent with the goal.
It also meets real-time read/write and HTAP requirements, simplifying the log processing process and architecture.

The log service analyzes Kubernetes logs in conjunction with the community

Kubernetes comes from the community, and using open source software for Kubernetes log processing is also the right choice in some scenarios.

The log service ensures the openness of data, connects with the open source community in collection, calculation, visualization and other aspects, and helps users enjoy the technical achievements of the community.

Here’s a simple example: Flink is used to consume the log library data of log service in real time. Shard of the source log library concurrently achieves dynamic load balancing with Flink Task. After data join processing with meta of MySQL is completed, another log library of log service is written through connector stream for visual query.

Logtail design in Kubernetes log collection scenario

In the second section of this paper, we review the problems encountered in the evolution process of Kubernetes log collection scheme. In the third section, we introduce the function and ecology of log service based on Ali cloud.

In this section, we will focus on the design and optimization of Logtail for Kubernetes. We will discuss how Logtail solves the pain points of Kubernetes log collection.

Kubernetes collection difficulties

Diversity of acquisition targets

Container stdout/stderr

Container application Logs

Host logs

Open protocols: Syslog, HTTP, etc

Acquisition reliability

In terms of performance, the logs must meet the scenario of heavy traffic on a single node and be collected in real time

Resolve container log vulnerability

Try to ensure the integrity of the collected data under all circumstances

The challenges of dynamic scaling

Automatic discovery requirements for container expansion and reduction

Reduce the complexity of Kubernetes deployments

Collection configuration Ease-of-use

How is the collection configuration deployed and managed

Pod logs for different purposes need to be stored in different categories. How to manage data routing

Logtail Highly reliable collection

Logtail supports semantic guarantee for at-least-once collection. The checkpoint mechanism at both file and memory levels is used to ensure resumable data in container restart scenarios.

During the log collection process, you may encounter various errors caused by the system or user configurations. For example, you need to adjust the parsing rules in time if the log formatting and parsing fails. Logtail Provides the collection and monitoring function to report exceptions and statistics to the log library, and supports query and alarm.

Optimized computing performance to solve the problem of large-scale log collection by a single node, Logtail achieves the processing performance of about 100MB/s for a single CPU core without log field formatting (singleline mode). For slow I/O operations sent over the network, the client batch commits multiple logs to the server for persistence, achieving real-time collection and high throughput.

Within Alibaba Group, Logtail currently has millions of client deployments, which is stable.

Rich data source support

To meet the complex and diverse collection requirements in Kubernetes environment, Logtail can collect data from the following sources: STdout/STderr, container and host log files, syslog, Lumberjack and other open protocols.

A log can be semantically shred into multiple fields to obtain multiple key-value pairs, and thus a log can be mapped to the table model, which makes the following log analysis more efficient. Logtail supports the following log formatting methods:

Multiline parsing. For example, the Java Stack Trace log is composed of multiple natural lines. The log is divided into logical lines by setting the regular expression at the beginning of the line.

Self-description parsing. Supports CSV and JSON formats to automatically extract log fields.

Meet more specific requirements through regular, custom plug-in methods.

Built-in parsing rules are provided for some typical logs. For example, a user can simply select an Nginx access log category on the Web console, and Logtail can automatically extract client_IP, URI, and other fields from an access log based on Nginx’s log format configuration.

The node level container should scale dynamically

Containers naturally perform normal capacity expansion or reduction. Newly added container logs must be collected in a timely manner; otherwise, they will be lost. In this case, clients must be able to dynamically detect collection sources, and deployment and configuration must be easy. Logtail addresses data collection integrity issues from the following two dimensions:

The deployment of

Quickly deploy Logtail to a Kubernetes node using DaemonSet, which can be done with a single command and is easily integrated with K8S application publishing.

After the Logtail client is deployed on a Node, it communicates with the Docker Engine through domain sockets to handle dynamic container collection on the node. Incremental scanning can detect container changes on Node in a timely manner. Coupled with periodic full scanning to ensure that no container change events are lost, this dual guarantee design allows timely and complete detection of candidate monitoring targets on the client side.

Collection Configuration Management

Logtail is designed from the beginning of the server centralized collection configuration management to ensure that collection instructions can be efficiently conveyed from the server to the client. This configuration management can be abstracted as a “machine group + Collection configuration” model. For a collection configuration, the Logtail instance in the machine group can immediately obtain the collection configuration associated with the machine group to enable the collection task.

For the Kubernetes scenario, Logtail designs a custom identity to manage the machine. A class of PODS can declare a fixed machine id that Logtail uses to report heartbeats to the server, and a group of machines uses this custom ID to manage the Logtail instance.

When the Kubernetes node is expanded, Logtail reports the pod’s custom machine id to the server, and the server sends the mounted collection configuration to Logtail.

At present, on the open source collection client, the common practice is to use the machine IP or hostname to identify the client. In this way, when the container scales, the machine IP or hostname in the machine group needs to be added or deleted in time. Otherwise, data collection will be missing and complex capacity expansion process is required to ensure.

Solve collection configuration management problems

Logtail provides two collection configuration management modes. You can choose one of them based on your preferences:

CRD. Deep integration with Kubernetes ecology, through event monitoring on the client can be linked to create log library on the log service, collection configuration, machine group and other resources.

WEB console. It is quick to configure the log formatting and parsing rules in a visual way, and associate the collection configuration with the machine group through the Wizard. The user only needs to set the log directory of a container according to the custom. Logtail will automatically render the actual log directory of the host when collection is enabled on the container.

We define logging from source to target (log library) as a collection route. It is very troublesome to use the traditional scheme to realize the personalized route collection function, which needs to be configured locally on the client. Each POD container writes out the route collection, which will have strong dependence on container deployment and management.

Kubernetes env is composed of multiple key-values, which can be set when deploying containers.

IncludeEnv and ExcludeEnv are included in the Logtail collection configuration to add or exclude collection sources.

In the figure below, the POD service container is started with the log_type environment variable set. IncludeEnv: log_type=nginx_access_log is defined in the Logtail collection configuration to specify the collection of POD logs for nginx-class purposes to a specific log library.

All data collected in the Kubernetes, Logtail automatically for the pod/namesapce/contanier/image dimension marking, facilitate subsequent data analysis.

Design of log context queries

Context query is used to query the previous or next log on the original machine or file, similar to grep -a -B on Linux.

In some scenarios, such as Devops, logical exceptions require this timing to assist in locating them, and context viewing can be much more effective. However, in distributed systems, it is difficult to maintain the original log order on both source and target:

In the collection client level, Kubernetes may produce a large number of logs, log collection software needs to use the machine’s multiple CPU core parsing, pre-processing logs, and through multi-thread concurrency or single-thread asynchronous callback to deal with the slow IO problem sent by the network. This prevents log data from reaching the server in the order in which events are generated on the machine.
At the server level of a distributed system, the horizontally extended multi-machine load balancing architecture enables the logs of the same client machine to be distributed on multiple storage nodes. It is difficult to restore the original order on the basis of scattered logs.

In the traditional context query scheme, logs are sorted twice according to the time when the logs arrive at the server and the log service time. This exists in big data scenarios: sorting performance problems, insufficient time accuracy problems, unable to truly restore the real timing of events.

Logtail is combined with the logging service (keyword query function) to solve this problem:

When logs of a container file are collected and uploaded, data packets are composed of multiple logs in a batch. Multiple logs correspond to a block of a specific file, for example, 512KB.

Multiple logs in a packet are arranged according to the source file log order, which means that the next log may be in the same packet or the next packet.

Logtail sets the unique log source sourceId for the data packet and the packet increment Id (packageID) in the uploaded data packet. Within each package, any log has a shift offset within the package.

Although packets stored behind the server may be unordered, the logging service has indexes to seek packets with the specified sourceId and packageId.

When specifying log number 2 of container A (source_id:A, package_id:N, offset:M), check whether the log offset in the current packet is the end of the packet (the number of log entries in the packet is defined as L, and the offset at the end is L-1).

If offset M is less than (L-1), the next log position is source_id:A, package_id:N, offset:M+1. If the current log is the last packet, the position of the next log is source_id:A, package_id:N+1, offset:0.

In most scenarios, a package obtained by keyword random query can support the context page turning with the current package length L times, which improves query performance and greatly reduces the random I/O times of background service.

Note: This article is transferred from the public account “Cloud Habitat Community”

Kubernetes log collection, storage and processing technology practice

Trends and challenges in Kubernetes log processing

Kubernetes serveless

The demand for real-time logging is becoming stronger and stronger

Centralized log storage

Evolution of the Kubernetes log collection scheme

Kubernetes log processing architecture

Kubernetes logging architecture practices for logging services

Related Posts

How do women respond to awkward questions in job interviews

MySQL > select * from ‘MySQL’

W3c.group Community Awards poll: As a developer, what do you think of blockchains like Libra?