Previous: The Logging Operator article was much delayed last year, and I thought it would go no further, but I recently looked at it again when I encountered the need to handle the observability part of the log in my KubeGems project, resulting in the third article in this series.

The Logging Operator is an open source log collection solution in cloud native scenarios under BanzaiCloud. In March 2020, it was reconfigured to v3. With efficient fluentbit and plugin-rich Flunetd, the Logging Operator has almost perfectly adapted to Logging scenarios in Kubernetes mode. The fact that Rancher has adopted a Logging Operator as a unified Logging solution since version 2.5 last year is enough to show that it is being accepted and integrated internally by several kubernetes-centric management platforms (including the small white KubeGems).

As a continuation of the previous two articles, this article will mainly talk about the recent cases and feelings of xiaobai using Logging Operator to solve user needs. Therefore, I will not spend a lot of time to describe the architecture and use of xiaobai. If you are interested, you can read xiaobai’s article.

About the indicators

In the process of application containerization, due to the temporary of container file system, developers are always faced with the dilemma of dropping their log files and outputting STDout. When the developer gives the right of application log management to the platform, it means that what the platform has to do is far more complicated than one-to-one application collection. One day, SRE asked, “We can see the real-time rate of log collection in Ali Cloud, and we need to customize the quality control index”. This problem also reminds me that when we were doing private cloud, we were always in the blind spot of missing information when we stood outside the platform and observed inside the log collection pipeline. Fortunately, both Fluentbit and FluentD have independent Prometheus plugins to expose internal indicators, but with the Logging Operator, the indicator collection relies on the Prometheus Operator and the architecture is clear enough. So we can also have more options to meet the needs of research and development.

First, we define logging by enabling the Fluent bit(D) to enable Prometheus acquisition

spec:
  fluentdSpec:
    metrics:
      serviceMonitor: true
      serviceMonitorConfig:
        honorLabels: true    // HonorLabels are turned on to keep the component's original labels from being overwritten.
  fluentbitSpec:
    metrics:
      serviceMonitor: true
Copy the code

As you can see here, the Logging Operator relies primarily on the ServiceMonitor for service discovery on the collector side, and Prometheus Operator needs to be run internally in the cluster to support the CRD. If the resource type in the cluster is not changed, Prometheus can use its own service discovery mechanism to discover and collect indicators.

By default, it only contains Fluent bit(D) internal basic running state. If you want to further realize the monitoring of log rate, you need Flunetd. In the early days of Google GKE using Fluentd as a log collector, I was intrigued by a Prometheus plugin configuration that I stumbled across (intentionally copied)

<filter * * >
  @type prometheus
  <metric>
    type counter
    name logging_entry_count
    desc Total number of log entries generated by either application containers or system components
  </metric>
  <labels>
    container: $.kubernetes.container_name
    pod: $.kubernetes.pod_name
  </labels>
</filter>
Copy the code

This rule matches all logs entering Fluentd and goes to Prometheus filter for counting. The collected indicator is named Logging_entry_count, and the metadata in logs is used as the label of the indicator to distinguish containers.

Fluentd’s Kubernetes-Metadata-filter plug-in is also needed to extract container metadata. The Kubernetes metadata in the Logging Operator is parsed in the Fluent Bit, eliminating the need to add an additional plug-in to Fluentd

Although Google GKE now changes the log collector to Fluent Bit, this configuration is not “obsolete” in the Logging Operator. With this in mind, we can introduce the Prometheus plug-in into the tenant’s log collector (Flow/ClusterFlow) to analyze log rates. The simplest practices are as follows:

apiVersion: logging.banzaicloud.io/v1beta1
kind: Flow
metadata:
  name: default
  namespace: demo
spec:
  - prometheus:
      labels:
        container: $.kubernetes.container_name
        namespace: $.kubernetes.namespace_name
        node: $.kubernetes.host
        pod: $.kubernetes.pod_name
      metrics:
      - desc: Total number of log entries generated by either application containers
          or system components
        name: logging_entry_count
        type: counter
  globalOutputRefs:
  - containers-console
  match:
  - select:
      labels:
        what.you.want: collect
Copy the code

After the preceding indicator is stored in Prometheus, you can use this statement to find out the application rate of the log collector in the current cluster

sum by (pod) (rate(logging_entry_count[1m]))
Copy the code

At this point, if the cloud platform is based on multi-tenant multi-environment architecture, you can even aggregate log rates by tenant environment and tenant level.

The above is only the collection and monitoring of the overall log rate. If we need to collect the specific content in the log or the byte statistics of the log, we need to combine with other plug-ins. The plugins currently supported by the Logging Operator are not nearly as rich as those supported by Fluentd, but we can refer to the official documentation to write the necessary plugins to integrate into the Operator. Logging Operator Developers Manual

The Logging Operator has its own set of rules for monitoring and alerting within the Logging component, which can be enabled in the Logging CR

spec:
  fluentbitSpec:
    metrics:
      prometheusRules: true
  fluentdSpec:
    metrics:
      prometheusRules: true
Copy the code

In this section, prometheusRules is also a resource managed by Prometheus Operator. If the resource type does not exist in the cluster, you can manually configure Rules for Prometheus

Going back to the original question, logging_entry_count can be used if you want to use the log collection rate as a quantitative metric for your application.

About the sample

In most cases, the logging architecture should not have uncontrollable policies for business logs that cause incomplete application logs, such as sampling. Obviously I don’t recommend enabling this feature in your existing architecture either. However, sometimes, or some sorcerers can not effectively control the program’s “prehistoric power” and crazy output, the platform for such nifty applications can be sampled, after all, ensuring the availability of the entire log channel is the platform’s first priority.

The Logging Operator employs Throttle, a throttle-throttling plugin for Logging sampling, which introduces a leak-bucket algorithm for every pipe that enters a Filter log, causing it to discard logs that exceed the rate limit.

apiVersion: logging.banzaicloud.io/v1beta1
kind: Flow
metadata:
  name: default
  namespace: demo
spec:
  - throttle:
      group_bucket_limit: 3000
      group_bucket_period_s: 10
      group_key: kubernetes.pod_name
  globalOutputRefs:
  - containers-console
  match:
  - select:
      labels:
        what.you.want: collect
Copy the code
  • Group_key: aggregate key for log sampling. Generally, we aggregate the key according to the pod name, or directly enter kubenretes.metadata and other values
  • Group_bucket_period_s: sampling time range. The default time is 60 seconds
  • Group_bucket_limit: indicates the maximum capacity of the log bucket during sampling

The log sampling rate is calculated by group_bucket_limit/group_bucket_period_s. If the log rate in group_key exceeds the value, subsequent logs are discarded.

Because Throttle does not use the token bucket algorithm, it does not have a burst to handle the burst log volume.

About log falling disk

As mentioned earlier, for all container-based applications, the best practice for logging is to direct logs to stdout and stderr, but not all wizards follow this convention, and logging is still the preferred option for most current development. Although standard (error) output for containers is theoretically also a centralized reassignment of log streams to log files under /var/log/containers, it is still subject to uncontrollable factors due to runtime configuration or other hard disk reasons.

There is no unified solution for the log falling disk scenario, but there are only two ways to achieve it:

  • Sidecars scheme

    In this solution, the log collector runs in pod with the application container and collects logs through the volume sharing path. A common approach is to develop a separate set of controllers for kubernetes and use MutatingWebhook to inject sidecar information during pod startup.

    The advantage of sidecar is that the sidecAR configuration for each application is relatively independent. However, the disadvantage is that in addition to occupying too many resources, the update iteration of the collector needs to follow the life cycle of the application, which is not elegant for continuous maintenance.

  • The node agent solution

    In this solution, the log collector runs in each Node with DaemonSet, and then collects logs in a centralized manner at the operating system level. Generally, this scheme requires the development of the platform to make a certain path policy, and mount the Vulume with fixed hostPath to the container for application log drop. In addition, we know all Kubernetes default storage types or third party storage that complies with CSI standards, / pod_id>/volumes/kubernetes. IO ~

    / /mount Therefore, a more elegant solution for Node Agent is to give the Node Agent the permission to call the Kubernetes API and let it know the path of the collected container logs mapped to the host.

    The Node Agent solution has the advantage of centralized configuration and management. Collectors and application containers are decoupled and do not affect each other. The disadvantage lies in the risk of insufficient throughput of the collector.

As you can see, neither of the two schemes has anything to do with the Logging Operator. To be sure, the community doesn’t have a proven solution for this scenario, but the idea is that we can handle this problem by turning log files into standard (error) output streams.

Use tail as an intuitive example to illustrate the above solution.

. containers: - args: - -F - /path/to/your/log/file.log command: - tail image: busybox name: stream-log-file-[name] volumeMounts: - mountPath: /path/to/your/log name: mounted-log ...Copy the code

While tail is extremely simple and crude and does not solve problems such as log rotation, it does provide a new solution for Logging operators in log falling scenarios. Although it looks similar to Sidecar, the major difference is that this solution works seamlessly with the Logging Operator’s existing Logging pipeline, and logs can still be processed in the flow phase after being collected.

conclusion

From the perspective of automatic operation and maintenance, Logging Operator station effectively solves the problems of complex log architecture and difficult application log collection in the Scenario of Kubernetes, although the current support for falling logs is not comprehensive enough. But there may be better solutions to today’s problems in the future, as the number of connected users grows. However, it is one of the best cloud-native logging architectures out there right now.


Follow the public account [cloud native xiaobai] and reply “enter the group” to join the Loki learning group