Serverless and log

Serverless is a kind of Serverless thinking. Developers do not have to worry too much about traditional server procurement, hardware operation and maintenance, network topology, resource expansion and other infrastructure operation problems, so that DevOps staff can return to the innovation and stability of the business itself.

In most scenarios, using the Serverless architecture is a win-win choice:

  • Users pay for the time they run or the resources they consume, and each expenditure is actually used to avoid waste; At the same time, the use of Serverless services provided by cloud computing vendors reduces enterprise personnel costs such as basic operation and maintenance.
  • For Serverless service providers, numerous users further optimize cloud resource utilization and make computing more “green”.

Serverless Trends and challenges

Containers are a hot topic in the Serverless space and have grown rapidly since their inception with good support for microservices and cost optimization. So much so that former Kubernetes co-founder Brendan Burns talks about “the future being dominated by containers, and the containers themselves running on Serverless infrastructure” (see article).

As Serverless continues to grow in the IT market landscape, the log as the background support is becoming a key part of the traditional transformation to Serverless architecture. Here’s a stat from ServerLess.com:

As you can see, the first three Severless challenges are Debuging, Monitoring, and Testing, which are classic DevOps issues. Serverless needs from the operation, operation and maintenance, security, and management levels are as urgent as those from the development level. And log is an important means to solve software development diagnosis, system monitoring, security, operation problems.

In the Servlerless trend, logging is an important role that cannot be ignored:

  • The evolution of the service architecture to Serverless is taking place, and log centralization to solve log collection and storage problems is one of the challenges.
  • Logging becomes a strong dependency for many Serverless scenarios: diagnostics, monitoring, DevOps, and the various log analysis requirements are the second challenge.

Serverless Centralized application logs

Centralized log analysis, including log collection and storage, is the prerequisite for enabling log analysis. Traditional log centralization is to collect log file data from scattered servers to servers. The requirements under Serverless are more complex: various types of data sources, reliable storage for data storage and subsequent big data analysis requirements.

In the broad sense, Serverless can be divided into three categories:

  • Container as a Service

Common forms include: deploying and managing containers by software such as Docker or Kubernetes; Host container and full hosting services provided by cloud service providers (such as Ali Cloud Container Service and Serverless Kubernetes).

Containers are flexible. They are dynamically created or destroyed during system running, which requires dynamic log collection capabilities. In addition, log collection solutions on traditional servers cannot be directly transplanted.

In the container scenario, there are many types of log services, such as microservices and Kubernetes Pod. Many types of logs may be scattered on multiple hosts. Container-level log collection points are cumbersome to set up and error-prone and costly to manage.

  • Backend as a Service, the Backend is a Service

BaaS is Serverless in a broader sense. It can be Serverless in storage, host software, relational database, NoSQL database, message queue, API service and other directions. The earliest commercial Serverless platform may be traced back to Google’s App Engine, and now AWS, Ali Cloud, Azure and other big companies provide rich cloud services.

When using cloud services, users need audit logs, service access logs, and network traffic logs. The threshold for collecting logs is that users cannot capture data through the client. Cloud service logs need to be collected in real time. Systems such as load balancing (SLB), object file storage (OSS), custom network access (VPC), and Content distribution and download (CDN) are on the core links of the service architecture. Real-time logs are required in o&M monitoring, risk and security scenarios. At the same time, service traffic peaks and valleys are directly reflected in cloud service log traffic. For example, if the PV of a website rises after 9 PM, SLB log traffic surges. High throughput collection and elastic scaling log storage capabilities are required.

  • Function as a Service

FaaS is Serverless at the computing level and is growing fast, with the market expected to grow to $7.72 billion by 2021, according to a report by BusinessWire. Typical representatives are: Airy Cloud FunctionCompute, AWS Lambda. It enables more granular renting of resources, paying for each execution. Users can write code and configure some parameters to build their own applications.

A function instance of FaaS is triggered by a trigger or user initiative, and the lifetime of the log ends when the function completes. The first requirement is a fast way to persist logs generated during a function’s execution. Second, after collecting logs, analysis requirements on diagnosis or monitoring can be fulfilled for different types of run logs and function running indicators.

One chart summarizes it as follows:

Diversified log analysis requirements

The main log analysis requirements can be listed in terms of log timeliness (from seconds to years) and users:

  • Developers focus on real-time log content, program exceptions, interactive Debug experiences, and fast exception pattern discovery.
  • O&m personnel pay attention to system monitoring and require real-time system and service indicator alarms.
  • The operation staff uses the logs to formulate strategies for business development and match refined requirements for each user profile.
  • From log analysis, managers get the revenue development and user growth that they are concerned about.

Core capabilities of Ali Cloud log service

Ali Cloud log service is a one-stop service for log data. It provides log data collection, consumption, delivery, query and analysis functions, helping users improve their ability to process and analyze massive logs. This section introduces the three core capabilities of logging services to help you establish the background for the two Serverless logging practices that follow.

Unified log collection and management platform

The log service provides a relatively complete collection tool or solution that supports the following scenarios: From traditional servers (logs, metrics), to applications (custom logs), to embedded IOT, mobile devices (WebTracking, metrics, logs, etc.), to the subject of this article, Serverless applications (logs or metrics for cloud services, containers, FaaS). It can help users quickly access all kinds of data to the log service storage of Ali cloud.

At the same time, the log service is a multi-tenant platform that supports flexible scaling of write and storage capabilities through Shard horizontal scaling. All kinds of data can be stored in different log libraries (LogStores) for subsequent joint analysis and classified management.

Computational analysis of real-time interactions

After logs are collected on a centralized data storage platform, traditional ETL connects the logs to various computing tools to meet complex requirements. Such as:

  • The data is streamed and the results are stored in the database. BI reports are used to load the database to display business indicators.
  • Small Batch builds a full-text index of the data, and the forward and inverted stores are combined to support keyword query requirements.
  • The intermediate result is calculated based on the original data, and alarm rules are applied to the intermediate result. When the threshold is reached, an alarm notification is triggered.

First, the service architecture becomes more complicated. M systems need to be connected to the storage layer to fulfill N scenarios. Moreover, these downstream system professionals are needed to develop docking and ensure stability, which increases the labor cost and business time cost of enterprises.

The log service indexes the incoming data in real time, providing LogSearch (full text retrieval) and LogAnalytic (SQL analysis) functions. A unified user layer interface (query statement) is provided to users for different requirements. SQL queries for different purposes are submitted to the computing engine to obtain results, and corresponding visualization or notification methods are used to complete report analysis, real-time alarm, and keyword query. This process can cover the main requirements of logging scenarios and has the following two characteristics:

  • Dynamic: Dynamic SQL (user needs change over time) performs calculations on dynamic data (real-time log generation) and returns user results in seconds.
  • Interaction: Completing a complex problem diagnosis or analysis often requires multiple interactions, such as: After receiving the alarm information of the service log, the developer can quickly drill-down to the corresponding indicator dashboard to view the overall trend according to the alarm content. Click an abnormal indicator on the dashboard to continue the drill-down to the related log context to view the content details.

Open, connected log downstream ecology

Log service is an open system, and the stored data can be easily integrated with other big data systems. The following is the downstream ecological map of log service, showing the consumption system supported in four major scenarios: online analysis, streaming computing, warehouse archiving and visualization.

To sum up, the downstream ecosystem of logging services is divided into three levels:

  • Integration with Ali Cloud products

    • Storage: post data to OSS, MaxCompute, and TableStore
    • Computing: Supports FunctionCompute triggers, integrated with Blink, EMR computing systems
    • Analysis and visualization: Complete system monitoring and data market through ARMS and DataV
  • Collaborate with community ecology

    • Calculation: Connect to Flink, Storm, Spark, Hive and other big data systems
    • Analysis and visualization: support Jaeger, JDBC, Grafana and other data integration
  • Connecting with third-party log vendors (in progress)

Practice 1: Analyzing load balancing access logs online

Aliyun SLB is a load balancing service that distributes traffic to multiple cloud servers. It expands the external service capability of application systems through traffic distribution and improves the availability of application systems by eliminating single points of failure. SLB+ECS is a classic service system architecture on Aliyun. This section describes the online analysis of SLB access logs.

The SLB access log function is enabled

Access logs record a complete user request and are necessary for service exception diagnosis. SLB logging of layer 7 access has two advantages over Nginx logging of accessLogs on back-end ECS:

  • More complete: include VIP, upstream_latency, etc
  • Broader sources: covers a small percentage of requests that do not reach the back-end server

SLB layer 7 access logs Enable the log collection service with one click. The core link is to record request logs on Tengine Proxy and forward them to LogStore for storage according to user defined rules.

The processing delay of the whole process is in the order of seconds, which has obvious advantages over the “T+1” or “hour” delay of log delivery in the industry. In addition, the log service LogStore supports Shard expansion and automatic splitting. It dynamically expands the write and storage capabilities based on the change of the write traffic, helping users survive the traffic increase of several times during peak hours.

Service alarms based on SLB access logs

Several fields in the access log about request latency:

field instructions
client_ip Request the client IP address
vip_addr VIP address
upstream_addr IP address and port of the back-end server
request_time Proxy Indicates the interval between receiving the first request packet and receiving the reply packet, in seconds
upstream_response_time The time, in seconds, between establishing a connection at the SLB back end and closing the connection after receiving data
status Proxy Indicates the status of the reply packet
upstream_status The proxy receives the status response from the back-end server

Configure an alarm task on the log service and use the following SQL to calculate Latency trends on each ECS Real Server in the last 15 minutes:

* | select date_format(__time__, '%m-%d %H:%i:%s') as t, upstream_addr, avg(upstream_response_time) * 1000 as avg_upstream_response_ms group by t, upstream_addr order by t asc limit 1000Copy the code

When the average second-level delay exceeds the threshold of 100 ms, SMS notification is sent:

Combined with drill-down for interactive analysis

The following short video demonstrates this:

  • In the SLB operation report, drill into the original log based on error code 499 to view details and use SQL to collect the top10 client IP address sources in the 499 status code
  • From the SLB operation report, select an SLB instance ID and drill into the SLB access center to view the detailed report statistics of the instance

Drill-down makes it easy to jump between reports and log queries, linking DevOps personnel’s diagnostic thinking.

Practice 2: Kubernetes log collection and diagnostic analysis

Kubernetes log collection difficulties

At present, all major cloud service providers provide host Kubernetes service, also known as Serverless Kubernetes. Compared with classic Kubernetes, users will no longer care about clusters and machines, but only need to declare the image, CPU, memory and external service mode of the container to start applications.

As you go from left to right, log collection becomes complicated:

  • A single Kubernetes node may run pods of a larger magnitude, and each pod may need to collect logs or monitor metrics, meaning that the amount of logs on a single node will be larger.
  • More types of PODS may run on a single Kubernetes node, and the diversity of log source types makes it more urgent to collect and manage logs and mark logs.

Kubernetes most native log solution is through kubectl logs command line tool, easy to use but container crash or release data will be lost, and there is no way to combine cloud, open source data analysis tools to do analysis. Therefore, a SideCar based collection scheme is officially proposed. Each POD has a service container and a collection client container, and the collection client container is responsible for data collection and reporting of the service container within the POD.

The Sidecar solution essentially decouples the business system from the logging system. It addresses basic requirements such as persistent log storage, but it still needs improvement in two areas:

  • If N pods are running on a Kubernetes node, N log client processes will be running at the same time, resulting in a waste of CPU, memory, port and other resources.
  • Managing the collection configuration (collection log directory, collection rules, storage target, etc.) for each Kubernetes Pod is cumbersome and not easy to maintain.

Use Logtail to collect Kubernetes logs

The log service uses Logtail (the data collection client developed by the log service) to supplement the official Kubernets Sidecar solution based on Fluentd Agent to try to solve some details of log processing experience problems in the Kubernetes scenario.

Logtail collection is a Kubernetes Node level solution. On each Kubernetes Node, Logtail DaemonSet is deployed to collect logs generated by all containers under the Node. After the AliyunLogController helm is deployed on the node, changes of containers and directories on the node can be monitored. Users can use CRD configuration to collect data. AliyunLogController applies for storage resources from the log service and creates collection points.

  • Dynamic Collection Support

    • Docker Engine communicates with domain socket to process container dynamic collection on the node. Incremental scanning can detect container changes on Node in a timely manner. Coupled with periodic full scanning to ensure that no container change events are lost, this dual guarantee design allows timely and complete detection of candidate monitoring targets on the client side.
    • For the Kubernetes scenario, Logtail designs a custom identity to manage the machine. A class of PODS can declare a fixed machine id that Logtail uses to report heartbeats to the server, and a group of machines uses this custom ID to manage the Logtail instance. When the Kubernetes node is expanded, Logtail reports the pod’s custom machine id to the server, and the server sends the mounted collection configuration to Logtail. At present, on the open source collection client, the common practice is to use the machine IP or hostname to identify the client. In this way, when the container scales, the machine IP or hostname in the machine group needs to be added or deleted in time. Otherwise, data collection will be missing and complex capacity expansion process is required to ensure.
  • Data management support

    • The log collection route from source to target (log library) is defined as a collection route. It is very troublesome to implement the personalized collection route function using the traditional scheme and needs to be configured locally on the client. Each POD container writes this collection route, which is strongly dependent on container deployment and management. Kubernetes env is composed of multiple key-values, which can be set when deploying containers. IncludeEnv and ExcludeEnv are included in the Logtail collection configuration to add or exclude collection sources. For example, when the service container is started, the log_type environment variable is set, and the Logtail collection configuration defines IncludeEnv: log_type=nginx_access_log to specify the collection of POD logs for nginx-class purposes to a specific log library.
    • During the data collection phase, Logtail records the location information of original files of Namespace, POD, and Container in Kubernetes. Combined with the log service query function, Logtail effectively implements the log context query function on the distributed storage system, similar to that on Linuxgrep -A -Bandtail -f. Details refer toDesign of log context queries.

Online diagnosis of Kubernetes logs

Traditional Linux developers use the grep/tail tool to check logs in real time. Based on the real-time collection and query capabilities of the log service, the following short video will demonstrate how to complete online diagnosis of Kubernetes logs:

  • In log analysis mode, use namespace, POD, or Container as the context-ID of the associated log, and click LiveTail to view the original context information of the log
  • In LiveTail mode, logs will be loaded in real time. Keywords can be entered for log filtering and content floating in red. Historgram at the top of the page shows the log generation rate in each fine-grained time window
  • When we find a noteworthy error in LiveTail mode, we can click it drill-down to return to the log analysis page

Kubernetes log data store

Target scenario: We hope to achieve extremely low storage cost of logs. We can store hot and cold data separately and also take into account ad-hoc analysis. For example, we can run full data calculation once a day.

In this case, the following options are available: If the log service stores the hot data that has been stored for the last one day for fault diagnosis and service alarm, and the one-click delivery of data is enabled for OSS, the log service copies the data to the OSS in minute granularity.

  • Fully managed delivery to OSS

The log service triggers the delivery task according to the time or size policy set by the user. The task runs in Serverless mode and provides a fully managed service. The log service supports row storage (JSON, CSV) and column storage (Parquet) files in OSS format.

  • OSS is stored

Files posted to OSS support the Parquet column storage built-in compression or SNAPPY universal compression format, combined with OSS low frequency storage types to optimize storage costs.

  • Storage computing separation

Data on OSS can be combined with big Data engines such as Ali Cloud Data Lake Analytics, HybridDB and EMR for calculation. For example: The file directory structure posted to the OSS can be divided into bucket, prefix, and datetime and mapped to the data warehouse to calculate hive-style projects, tables, and partitions to optimize data scanning and filtering performance.

conclusion

This paper mainly introduces the development of Serverless and the role of log in this wave, and provides two practical references for users to build Serverless application log processing architecture through real-time collection, reliable storage and interactive analysis capabilities provided by Ali Cloud log service.