When we look at Loki’s architecture documentation, the community claims that Loki is a logging system that can run in multi-tenant mode, but when we look more closely, it implicitly states that Loki only needs to meet two conditions for multi-tenant:

  • Added to the configuration fileauth_enabled: true
  • The request header contains tenant informationX-Scope-OrgID

All of this seems to say, “Use me, it’s easy,” but we need to think about a lot more than that when we actually build a multi-tenant logging system in Kubernetes.

Typically when we are dealing with a multi-tenant logging architecture, there are two patterns that affect the architecture of the system for log storage purposes.

1. Centralized log storage (hereinafter referred to as Solution A)

Like Loki native, logs are written to back-end storage centrally after they are entered into a cluster and indexed by a series of checksums.

2. Partitioning log storage (hereinafter referred to as Scheme B)

Anti-central storage architecture where each tenant or project can have a separate logging service and storage block to store logs.

Intuitively, the overall structure of log partitioning is more complex, in addition to the need to develop your own controller to manage the life cycle of the Loki service, it also needs to provide the correct routing policy for the gateway. However, regardless of the choice of multi-tenant system, in this article we need to illustrate the implementation of different solutions from the overall flow of logging.

Level 1: Loki division

Loki is the ultimate log storage and query service. In multi-tenant mode, there is some configuration space in Loki for architects to accommodate, whether it is a large cluster or a small service. Especially in the scenario of large cluster, it is very important to ensure the reasonable allocation and scheduling of log writing and query resources for each tenant.

In the native configuration, most of the tenant adjustment can be done in the following two configuration blocks:

  • query_frontend_config
  • limits_config

query_frontend_config

Query_frontend is the frontend of day query in Loki distributed cluster mode, which is responsible for decomposing and aggregating user log query requests. Obviously, query_frontend should avoid over-preempting the processing resources for requests by a single user.

Each frontend handles tenants

[max_outstanding_per_tenant: <int> | default = 100]
Copy the code

limits_config

Limits_config basically controls Loki’s global flow control parameters and local tenant resource allocation, which can be enabled by Loki’s -run-time config startup parameter to dynamically load tenant restrictions periodically. This can be seen in the runtimeConfigValues structure in runtime_config.go

type runtimeConfigValues struct {
	TenantLimits map[string]*validation.Limits `yaml:"overrides"`

	Multi kv.MultiRuntimeConfig `yaml:"multi_kv_config"`
}
Copy the code

As you can see, the TenantLimits configuration inherits limits_config directly, so the structure of this section should look like this:

overrides:
  tenantA:
    ingestion_rate_mb: 10
    max_streams_per_user: 100000
    max_chunks_per_query: 100000
  tenantB:
    max_streams_per_user: 1000000
    max_chunks_per_query: 1000000
Copy the code

When choosing the day architecture for Scenario A, the constraint logic on the tenant part should be configured flexibly based on the log size within the tenant. If plan B is selected, this logic is directly controlled by the native Limits_config, since each tenant owns the full Loki resources.

The second level: log client

In the Kubernetes environment, the most important thing is to let the logging client know the tenant information of the container being collected. This part of the implementation can be done by logging Operator or parsing kubernetes metadata. Although the two implementations are different, the ultimate goal is for the client to add tenant headers to the log flow request after the collection day. I describe their implementation logic in terms of logging-operator and Fluentbit/Fluentd implementations

Logging Operator

The Logging Operator is an open source log collection solution in cloud native scenarios under BanzaiCloud. It controls log parsing and output by creating CRD resources flow and output at the NameSpace level.

You can use Operator to control the containers in which logs of tenants need to be collected and the flow of logs. Using the output to Loki as an example, the following resources can usually be created in the tenant’s namespace.

  • Output.yaml, which brings in tenant – specific information when creating the resource
apiVersion: logging.banzaicloud.io/v1beta1
kind: Output
metadata:
 name: loki-output
 namespace: <tenantA-namespace>
spec:
  loki:
    url: http://loki:3100
    username: <tenantA>
    password: <tenantA>
    tenant: <tenantA>
...
Copy the code
  • Flow. yaml, which associates tenants with containers for logs to be collected when creating resources, and specifies output
apiVersion: logging.banzaicloud.io/v1beta1
kind: Flow
  metadata:
    name: flow
    namespace:  <tenantA-namespace>
  spec:
    localOutputRefs:
    - loki-output
    match:
      - select:
          labels:
            app: nginx
    filters:
      - parser:
          remove_key_name_field: true
          reserve_data: true
          key_name: "log"
Copy the code

As you can see, managing multi-tenant logs through operator is a very simple and elegant way, while creating resources through CRD is also user-friendly for developers to integrate into projects. This is also my preferred logging client solution.

FluentBit/FluentD

FluentBit and FluentD’s Loki plug-in also support multi-tenant configuration. The most important thing for them is to make them aware of the tenant information in the log. Different from Operator directly declaring tenant information in CRD, the client solution requires Kubernetes Metadata to proactively capture tenant information. The definition of tenant information is declared in the label of the resource. However, for different clients, the path defined by the label is more elaborate. Their overall processing process is as follows:

  • FluentD

Fluentd’s kubernetes-metadata-filter captures namespaces_label, so I prefer to define tenant information in a namespace.

apiVersion: v1
kind: Namespace
metadata:
  labels:
    tenant: <tenantA>
  name: <taenant-namespace>
Copy the code

In this way, the Loki plug-in can directly extract the tenant tag content in the namespace, the implementation logic is as follows

<match Loki.**> @type Loki @id Loki. Output URL "http://loki:3100 ${$.kubernetes.namespace_labels.tenant} username <username> password <password> <label> tenant ${$.kubernetes.namespace_labels.tenant} </label>Copy the code
  • FluentBit

Fluentbit metadata is captured from pod, so we need to define tenant information in the workload template.metadata.labels as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
 labels:
   app:  nginx
spec:
 template:
   metadata:
     labels:
       app: nginx
       tenant: <tanant-A>
Copy the code

Rewrite_tag is then used to extract the tenant information of the container for log pipe splitting. In the output stage, different log channels are output. Its implementation logic is as follows:

[FILTER] Name kubernetes Match kube.* Kube_URL https://kubernetes.default.svc:443 Merge_Log On [FILTER] Name rewrite_tag Match kube.* # Rule $kubernetes['labels']['tenant'] ^(.*)$tenant.$kubernetes['labels']['tenant'].$TAG false Emitter_Name re_emitted [Output] Name grafana-loki Match tenant.tenantA.* Url http://loki:3100/api/prom/push TenantID "tenantA" [Output] Name grafana-loki Match tenant.tenantB.* Url http://loki:3100/api/prom/push TenantID "tenantB"Copy the code

Whether FluentBit or Fluentd is used for multi-tenant configuration, they not only have certain requirements on labels, but also are not very flexible in log output path configuration. Therefore, FluentD is more suitable for solution A, while FluentBit is more suitable for solution B.

Layer 3: Log gateway

Log gateway is exactly the gateway of Loki service. For scheme A, the gateway in front of A large Loki cluster only needs to meet the requirement of horizontal expansion, and tenant header information is directly transferred to the rear Loki service for processing. Such schemes are relatively simple and unexplained. Just note that the configuration of the query interface needs to be debug-optimized, such as connection timeout between the gateway service and upstream, size of the gateway service response packet, etc.

The log gateway described in this document addresses the log routing problem for different tenants in scenario B. As you can see above, in Scenario B, we introduce a controller to solve the problem of managing tenant Loki instances. However, this brings a new problem to be solved, that is, Loki services need to register with the gateway and implement the generation of routing rules. This part can be configured by the cluster’s controller CRD resource as the gateway’s UpSteam source. The logic of the controller is as follows:

When the gateway service processes the tenant Header information, the logic of the routing part is to judge the log request with the tenant information in the Header x-scope-Orgid and forward it to the corresponding Loki service. Let’s take nginx as a gateway example. Its core logic is as follows:

Upstream {server X.X.X.X :3100; } upstream tenantB { server y.y.y.y:3100; } server { location / { set tenant $http_x_scope_orgid; proxy_pass http://$tenant; include proxy_params;Copy the code

conclusion

This article introduces two logging architectures based on Loki in multi-tenant mode, namely centralized log storage and partitioned log storage. They have the following characteristics respectively:

plan Loki architecture Client architecture The gateway architecture The development of the difficulty Operational difficulty Degree of automation
Centralized log storage Cluster, complexity fluentd / fluentbit simple simple medium low
Partitioned log storage simple Logging Opeator More complicated More complex (controller part) medium high

Students with kubernetes operator development experience in the team can use the log partition storage solution. If the team prefers operation and maintenance, you can choose the centralized log storage solution.