Automatic log collection and analysis in the evolution of Alipay App

background

Combining the ant gold suit face million level concurrent scenarios the components of the system design, we can to understand alipay component system based on mobile terminal construction way and the thinking behind, in this paper, based on the service side to form a system of big background, the “automation cell phone log and analysis” is emphatically discussed in paid treasure to evolution path within the App.

Alipay mobile terminal technical service framework

This is the technical architecture diagram of the whole mobile terminal wireless foundation team of Alipay. Meanwhile, other businesses in Ant Financial system, such as Kubei, MyBank, Hong Kong Alipay, Tianhong Fund, etc., conduct code development, packaging, grayscale release, online, problem repair, operation and analysis through the mobile development platform mPaaS. Therefore, mPaaS is derived from the core technology architecture of Alipay mobile terminal, and provides specific capability support in each link of the entire App life cycle. Next, we will focus on the architectural evolution and selection thinking behind log Diagnostics and mobile Analytics capabilities.

Alipay mobile terminal technology service framework: data analysis framework

The technical architecture diagram of data analysis capability is shown in the figure, in which “data synchronization”, “buried point SDK” and “log gateway” are exclusive capabilities of mobile terminal, and the rest are data structures that all data analysis platforms must have.

1. Collect logs

Next, we will focus on the analysis of the log collection framework of Alipay mobile terminal. The first part is the “Log SDK”, which provides a buried interface to the business layer, just like the feeling of Logger. info in Java: the business layer only needs to transfer the information it wants to record to the log SDK. Journal SDK will get the business, to get the context of the relevant system level information within the system, such as type, operating system version, version of the App, mobile phone resolution, user ID (if the user is logged in), device ID, on a page, such as the current page, then send these information to business log content into a buried point, It is then recorded on the device’s hard drive. Yes, it was recorded on the hard drive, not reported to the server.

The log SDK interacts with the log gateway at an appropriate time to determine the type of logs obtained and the level of logs that can be reported. If it can be reported, what frequency and network conditions should it be reported? Therefore, through the delivery of the log SDK and log gateway, we can achieve strategic degradation of the log. The strategic degradation of logs is particularly important for Alipay, because of the current size of Alipay, the daily log report volume is about 30W logs /s; And in the great promotion of time, log volume will be daily dozens of times! Therefore, if we do not adopt any log degradation strategy during the promotion period, our bandwidth will be packaged by the log, and the normal business functions of Alipay will be unavailable.

From this, we can conclude that in a high-concurrency scenario, we need to keep only key logs uploaded and degrade the rest. Even for daily business processing, we only allow logs to be reported under WIFI conditions to prevent traffic related complaints.

In addition to enabling log reporting, the buried gateway receives logs from clients. After receiving client logs, it verifies the log content and discards invalid logs. For valid buried points, the system inverts the city-level address location information based on the public IP address reported by the client and adds the location information to the buried points. Then, the buried points are stored on their own hard disks.

2. Classification of buried points

After years of practice, Alipay buried the log points into four categories.

(1) Behavior burying point: used to monitor business behaviors, that is, the log burying point transmitted by the business layer belongs to the behavior burying point, and the behavior burying point belongs to “manual burying point”, which needs to be developed by the business layer developer. However, not all business behaviors need to be manually developed by the business side. For some business events with very general applicability, Alipay puts their buried point records into the framework layer, such as reporting active events and login events. Therefore, behavioral burying point can also be called “semi-automatic burying point”.

(2) Automation burying point: it belongs to “full automation burying point”, which is used to record some general page-level and component-level behaviors, such as page opening, page closing, page time consuming, component clicking, etc.

(3) Performance buried point: belongs to the “fully automated buried point”, which is used to record the App’s power usage, flow usage, memory usage, startup speed, etc.

(4) Abnormal buried point: belongs to “fully automated buried point”, strictly speaking, is also a kind of performance buried point. However, it records the most critical performance indicators that directly affect users, such as the App’s flash backoff, jam, lag, etc. This kind of buried point, it belongs to the immediate promotion period can not downgrade the buried point!

The code example in the figure is an example of a behavior buried spot, which you can see is actually a CSV text. We can see the address and location information of the log gateway, as well as the information of the client device added by the log SDK.

3. Log processing model

The following is a general understanding of alipay’s internal log processing process:

(1) Log segmentation

We’ve already seen that a buried point is really just a piece of CSV text. The so-called log sharding is converting CSV text into KV, which is basically a HASHMAP in memory. This process can be toggled by a comma, but there are many other ways.

(2) Log switching

The contents of the buried log sites are not directly available for analysis. For example, the client startup time, the relevant data is reported according to the granularity of milliseconds, but in practical application, we only need to analyze the data to the number of a specific range, so before processing, we usually do a specific startup time to the end of the range of coding mapping. In addition to this transformation, there are some filtering transformations such as whitelist, blacklist, and so on.

(3) Dimensional table projection

Since the client cannot get all the data dimensions that the business side needs to analyze, such as the user’s gender and occupation, we can map the user ID in the log burying point to specific attributes of the business level for subsequent analysis through dimension table mapping before real analysis.

(4) The filter weight of UID

UID filter weight index has two kinds: one is PV index, one is UV index.

UV index before doing a specific calculation, to do a step UID weight. When a UID is deduplicated, it checks to see if the ID is present within a certain period of time. If so, the record must be discarded. UV has the concept of time period, such as day UV, hour UV, minute UV, month UV and so on.

(5) Index aggregation

After completing all the above procedures, it is finally time to calculate the index. Methods of calculation include summation, average, maximum, minimum, and 95th percentile.

(6) The result is written

This means that the results of calculated metrics are output to various stores or sent to mPaaS clients through interfaces. At present, there are three main types of computing methods: real-time computing, real-time computing and offline computing. I’m going to talk about these three methods.

Real-time computing

Real-time calculation mode: after receiving the log, the model immediately starts to calculate, and the data results will be produced within N minutes (N<=2), with a very low delay.

The recommended technology selection for real-time computing includes: 1) Flink 2) Spark 3) Storm (JStorm by Ali) 4) Akka; Akka is suitable for real-time monitoring and alarm with lighter business logic. Key business market, log translation playback these are better to use Flink.

The summary is as follows:

The advantages of real-time computing are fast output, less resource consumption and medium flexibility. The disadvantages of real-time computing are steep learning curve, high maintenance cost, and high configuration complexity.

Off-line calculation

Offline computing mode: After receiving logs, the system saves them without processing them. After the log volume is accumulated for a period of time, the model can process multi-day or multi-month log data at a time.

Recommended technology choices for real-time computing include: 1) Flink 2) Spark 3) Hive 4) Hadoop; You see FLINK\SPARK also appearing in the recommended technology selection for real-time models. The two technologies started from different starting points and reached the same end point.

Flink started out doing real-time computing, and then he started offering bulk offline computing power; Spark is the opposite. For now, let’s take the best advantage of each stack and go offline with Spark instead of Flink. Here say two sentences, before there is a classical architecture is “Lambda” model, refers to the real-time computation results to calculate in offline again, this is because the lost is always preceded by a real-time calculation data is not allowed, calculation, so the real-time calculated results are used to observe a trend, wait for the next day with the results of offline, This ensures that accurate data is visible to the business side. But after Google’s data-flow model paper, all the real-time computing engines on the market now have “exactly once” computing capability, in other words, real-time computing no longer suffers from data loss and inaccurate calculations. At present, although there is still the “Lambda” model, that is, data will flow to the real-time computing and offline computing engines, offline computing is no longer a supplement for real-time computing, but to make full use of its performance to calculate multi-day and multi-month indicators.

The summary is as follows:

The advantages of off-line computing are high computing performance and low learning difficulty. The disadvantages of offline computing are large resource consumption, high latency and low flexibility.

Real-time calculation

Real-time computing mode: After receiving logs, simple pre-processing is performed, such as log sharding, and then the logs are stored directly. When the interface needs to be displayed, data is processed immediately according to the user-defined filtering and aggregation rules.

Recommended technology models for real-time computing include: Clickhouse (from Yandex), Apache Druid (from MetaMarket), Pinot (from LinkedIn). Alipay internally applies the real-time computing model to scenarios such as drill-down analysis, funnel analysis, and sometimes directly as an OLAP database.

The summary is as follows:

The advantages of real-time computing are ultra-high flexibility, task-dimension UV polymerization, unlimited drilling capability, deliverable query latency (within 15s), and low learning costs; The disadvantages of real-time computing are huge resource consumption, medium delay, high dimensional cost and complex structure, which cannot be used for real-time monitoring and alarm.

This is the technical architecture diagram of a real-time computing framework inside Alipay. It can be seen from the figure that the current technical architecture of real-time computing includes real-time log Nodes (Read-time Nodes) for receiving data writes, and deep storage for historical data (HDFS, AFS, and OSS) A historical node (historical) used to provide query analysis capabilities for historical data (data that is one day old). This computing framework fully supports the MySQL protocol, and users can operate it directly with the MySQL client. Another important feature is that it can analyze any combination of external data.

Log processing model

Let’s summarize three calculation modes:

Real-time calculation model: after data ingestion, the calculation is performed immediately according to predetermined rules, and the required calculation results are produced within N minutes.

Application scenarios: Real-time monitoring and warning, link tracing Advantage: low resource consumption and high output speed Disadvantage: high configuration complexity and low flexibility

Offline computing model: Batch processing shall be conducted according to predetermined rules after N hours/day of data storage after ingestion.

Application scenario: User group and data mart Advantages: high performance and low learning cost Disadvantages: high delay and low flexibility

Real-time computing model: after data ingestion, simple pre-processing is done and data is immediately put into the database, and indicators are calculated in real time according to the query demand when the data is queried.

Application scenario: Drill-down analysis and funnel analysis Advantages: high flexibility and low learning cost Disadvantages: high resource consumption and delay We hope you can choose an appropriate technology stack according to our recommendations.

Dynamic buried point

After saying our current buried point type, buried point calculation model, let’s talk about the next generation of buried point framework that we currently use internally, dynamic buried point.

We first for the above four problems in series of ideas, each problem I also give the corresponding ideas and solutions. So, are these the best or the only answers to these questions? Obviously not, because all of these solutions lead to development. For the client, the new amount of development is required to release the version, and at the same time, the development of new burial points may also lead to the overcrowding of burial points inside the App.

What is a dynamic burial point? There are three core concepts: 1) buried point set, 2) dynamic buried point reporting rule, and 3) dynamic configuration of index calculation.

With the above three capabilities and configurations, we can configure the monitoring buried points of a link according to the existing buried points set, and define the calculation rules of specific indicators according to the complex buried points, so that the effect of new indicators can be quickly monitored without increasing the amount of development. Not only that, but we were able to make burials a universal resource that people could use more of, rather than adding burials and making code more bloated.

Let’s take a look at some UI delivery diagrams for a dynamic buried point framework.

Accordingly, let’s focus on the application service related to alipay’s burial point: real-time log pull. Its main technical framework includes MPS (message push), MSS(data synchronization) and log gateway in mPaaS. This is because ant system App keeps a long connection with MPS (Android system) and MSS (Apple system). When you need to pull logs in real time, users can issue commands on the console of mPaaS through these two channels, and then the client will report all detailed encrypted logs to the log gateway.

This is the interface operation diagram of real-time log pull.

According to years of experience, mPaaS prepared corresponding large plates for very important performance indicators: flash back, lag, and deadlock. As shown in the figure, the upper part is the number of flicker backoff per minute. Below is the classification of each flicker backoff sorted out according to the flicker backoff classification algorithm. We can also see the specific device distribution in each classification.

Finally, let’s take a logical look at the server structure of mPaaS. MPaaS consists of five core components:

MGS (Gateway service) : forwards requests from clients to the business server.
MDS (publishing service) : provides clients with the publishing capability of multiple gray policies for multiple resources.
MPS & MSS (Message push and Data Synchronization service) : Provides data delivery capabilities based on long connections.
MAS (analytic Services) : also the focus of today’s presentation, analysis capabilities based on log burial points.

If you are interested in mPaaS mobile analytics services, please click the documentation address for more details.

Past reading

The opening | ant gold clothes mPaaS core components of a service system overview”

Summary of Ant Financial mPaaS Server Core Component System: Mobile API Gateway MGS

Core Components of Ant Financial mPaaS Server: Analysis of Mobile End-to-end Network Access Architecture under Hundred-million-level Concurrency

Core Components of mPaaS Server: Architecture and Process Design of MPS for Message Push

MPaaS Core Components: How does Alipay build public Opinion Analysis System for Mobile Terminal Products?

“MPaaS Server Core Components: Mobile Analysis Service MAS Architecture Analysis”

Component System Design of Ant Financial facing Hundred-million-level Concurrency Scenario

Follow our official account for first-hand mPaaS technology practices

Nail Group: Search group number “23124039” by nail group

Looking forward to your joining us

Automatic log collection and analysis in the evolution of Alipay App

background

Alipay mobile terminal technical service framework

Alipay mobile terminal technology service framework: data analysis framework

1. Collect logs

2. Classification of buried points

3. Log processing model

Follow our official account for first-hand mPaaS technology practices

Related Posts

Android Development (31) Animation demo – A dialog pops up from the bottom of the page, gradually descending as it disappears

Hill sorting sounds a little difficult, but it’s actually quite simple

Python image processing cropping, rotation, mirroring