Small companies built their own front-end monitoring buried point system, proved feasible

Front-end early chat conference, by the front-end early chat and dig gold jointly held.

The 7th Session – The Landing practice of Micro front End in middle background and other scenarios/Design Method of Qiankun, May 30, 7 lecturers, registration address: huodongxing.com/go/tl7

The 8th session – how to interview into the big factory, from ali and other big factory candidates to share with the interviewer (the author of this article: ba tian little sister is the interview into the 1688 lecturer team oh), May 31, 15 lecturers, registration address: huodongxing.com/go/tl8

“Codingdreamer” is just the person you want to meet

Last year, there was a text version of the share, at that time the system was not fully built, we can combine the two articles together to see: technical exploration: 60 days rapid self-development – front buried point monitoring and tracking system big Prodigal son.

This is the fifth session – Front-end Monitoring System Construction special lecturer – Jimmy’s share – brief lecture version (please see the video for the complete version) :

It has been about 5 years since Song Xiaoxai’s first APP was launched, and our end application type has gradually changed from RN APP to RN APP/PC/H5/ small program and other types of coexistence. We started to develop our own buried point monitoring system from the beginning of 2013, and it took initial shape and went online at the end of that year and has been running until now. This time, I will introduce how we developed the current buried point monitoring system from the following aspects:

Part one: briefly talk about why we want to research this system and design and development of this system some ideas; The second part: a brief introduction to the realization of some SDK; Part three: how to deal with the reported log and the pits; Part four: a brief introduction and display of monitoring kanban; The fifth part: introduce the design of our alarm controller; Part six: A brief description of the task executor. First of all, let’s talk about the design of the monitoring system.

The basic purpose of front-end monitoring

The basic purposes of front-end monitoring in our opinion are as follows:

Usage of the developed application: whether users use it, how many users use it; What kind of problems users encounter in the process of using; How should developers and operators track and locate these problems and solve them in time; At the same time, learn from them to avoid reoffending; Embedded data feeds the business: What operations and product owners can learn from it to improve product quality. Development history Based on our thinking on the monitoring system, Xiaocai Front-end has gradually improved its own monitoring system in the past five years, which is almost a simple development history of our monitoring system:

Consideration of r&d cost is the root cause of the above development history:

In the case of insufficient manpower in the early stage, we still mainly rely on third-party tools, namely Umeng and Countly. The technology stack used at this time is mainly RN and native Android applications. Since RN is still a relatively new technology in about 15 years, there is no mature buried point monitoring tool for RN on the market. So we made a magic change to the tripartite SDK to do that; In the second stage when we have to apply all switch to RN development, then gave up the first phase of the deformity, use the simpler half since the research plan, then buried point and monitoring are separate, use Bugsnag monitoring aspect, but its have restrictions on data synchronization, data report processing is completed and the back-end students cooperation, MongoDB is the storage medium. Due to flexible data structure design and large data storage, there are bottlenecks in data processing. In the third stage, due to the rapid development of our business, front-end applications were no longer limited to RN APP, but also included wechat small programs and a large number of PC/H5 applications, and the number and frequency of H5 applications had gradually surpassed RN applications, so we started the research and development of the current multi-terminal buried point monitoring system. At present, it covers all front-end applications and part of back-end applications. The system has been in use since its launch in October, 19. Next, we should consider how to design the system. The system includes two parts: burying point and monitoring. Since this topic is monitoring, here are the basic modules related to monitoring system:

Collection module: how data should be collected, which end, which data; Storage: How data should be stored and what data structures should be reported and stored; Alarm: how should the alarm system be designed, how to detect errors, and how to notify the person in charge; Exception management: how to classify reported exceptions and manage them; Display: Summarize the occurrence of exceptions and present them to the user. System Architecture The following is the current system infrastructure:

The client

Currently, the client covers PC/H5, RN applications and small programs. As Node applications are seldom used in business, SDK has not been developed considering the input-output ratio.

Log processing

Log processing goes through three layers:

The first layer considers that the flow is large and adopts the way of cluster to disperse the pressure, and at the same time does the initial processing of the data, which will be described in the following chapters; The second layer is mainly the use of Kafka cluster buffer, reduce the pressure of ES writing logs, but also has the function of caching data, to avoid the data loss caused by THE breakdown of ES, Filebeat is to deal with Kafka problems when backup; The raw data is stored in Elasticsearch after processing. The data processing

Here is the third layer of logging:

Buried data on the end will be stored in the data warehouse after processing, which is the work of the backend students; The error data collected from the front end is processed in the background of the monitoring system, which will be discussed later. The data show

One is the presentation of buried data, which is processed by the back-end students and stored in the data warehouse, and then displayed by our front-end visual report system. The second is the display of monitoring error data, which is processed by the monitoring kanban of the monitoring system. The data flow between modules of the monitoring system is shown below, which shows the whole process of data flow from data reporting to display in the monitoring system

SDK implementation The following is a brief description of SDK implementation

What data to collect The first consideration is what data to collect

Although it is to monitor error data, sometimes in order to analyze the cause of error or reproduce the scene of error, we still need to use user behavior data, so the data to be collected includes two aspects:

User behavior data:

User page operation: click, slide, etc.; Page hopping: SPA/MPA page hopping, APP/ small program page switching, etc. Network request; Console output: console.log/error or Android log; User-defined event reporting. Wrong data

Front-end JS errors: syntax errors compatibility errors and other APP Native errors due to SDK implementation due to time reasons, here is a brief description of the simple implementation of the SDK of the two ends:

RN SDK

Xiao CAI’s RN application is a relatively pure RN application, so RN SDK implementation can be simply divided into two ends:

JS end

Error catching: RN already provides a handy API, ErrorUtils.setGlobalHandler, similar to the browser’s window.onerror, For capturing JS runtime error Promise. RejectionTracking Unhandledrejection similar to the browser events also provide custom report errors, developers can try/catch in the process of development and then call the network request: Send /open/onload: The routing component of RN application is a custom wrapper for the three-party component React-Navigation. We only need to use screenTracking in onStateChange or Redux integration to Native end

IOS uses KSCrash to collect logs. The captured data (including the JS terminal and native terminal) can be symbolized locally and reported to display the code in a unified manner

Promise. RejectionTracking code display

Here is a brief talk about the two aspects of wechat small program SDK implementation:

Network request

Wx. request method to proxy global object wx:

import “miniprogram-api-typings”

export const wrapRequest = () => { const originRequest = wx.request; wx.request = function(… args: any[]): WechatMiniprogram.RequestTask { // get request data return originRequest.apply(this, … args); //}} The page is displayed

Overrides the Page/Component object and proxies its lifecycle methods:

/* global Page Component */

function captureOnLoad(args) { console.log(‘do what you want to do’, args) }

function captureOnShow(args) { console.log(‘do what you want to do’, args) }

function colProxy(target, method, CustomMethod){const originMethod = target[method] if(target[method]){target[method] = function() {customMethod(… arguments) originMethod.apply(this, arguments) } } }

// Page const originalPage = Page Page = function (opt) { colProxy(opt.methods, ‘onLoad’, captureOnLoad) colProxy(opt.methods, ‘onShow’, captureOnShow) originalPage.apply(this, arguments) } // Component const originalComponent = Component Component = function (opt) { colProxy(opt.methods, ‘onLoad’, captureOnLoad) colProxy(opt.methods, ‘onShow’, captureOnShow) originalComponent.apply(this, Here’s how we do log handling:

The basic structure of the log processing module (called log-transfer) will be shown first.

Structured data reporting is the first layer of processing raw data and contains the following features

Adopt multi-node mode: Considering that the data traffic may be large, adopt multi-node to share the pressure (here each node is containerized). Each node has Filebeat as backup: Considering the stability of Kafka for subsequent processing, Filebeat is added as a backup to decrypt the reported data and verify its correctness. Processing partial data fields to avoid data loss, as discussed later; Add some fields that are not available on the client SDK: IP; Forward the processed data. There are a lot of points that need to be paid attention to in the log reporting process. Here are a few points to briefly say:

Because of the large amount of data, so all the data is not written in an index, this requires on time or on days when indexing to save data Because of a, we need to set up a fixed index template, thus the data type of a field must be unified, otherwise it will cause the data save failed, the following is a wrong example:

Index template is the premise of all the report has a unified data structure, but due to the different characteristics of impossible to report all the fields are completely unified, there is a constant field and change (that is, based on the types of special field), which fields become which fields need to design systems measure, namely to maintain moderate flexible:

If JSON data is used for reporting and is still saved as JSON data in ES, although there is an index template, new fields will be generated when the fields that are not taken care of by the template are reported, resulting in an explosion in the number of fields. For example, many fields will be generated after iOS native error symbolization is reported. Processing extra fields is one of the functions of log-Transfer discussed in this section. Next, a brief description of monitoring Kanban:

The display line is a few simple display screenshots of the monitor kanban

role

The functions of Kanban include:

Real-time PU/V view; Real-time error log view; Issue management; Alarm task view and edit, etc. What is an Issue? That is, after summarizing and summarizing the same type of errors reported, the data is abstracted to facilitate us to track and process the same type of errors.

An Issue has a clear life cycle, which is our Issue life cycle shown here:

For an issue that does not need processing, you can set it to the ignored state. If the assigned issue cannot be handled by itself, it can be transferred to others. The processed issue will not be closed immediately, and the system will conduct online verification after publication. If the issue recurs, it will reopen. Here is an example of an error detail:

The stack information transformed using the Source Map is also shown in the error details.

Alarm control Next, let’s talk about how we design and develop alarm control module:

Structure and action structure

The first is the system structure:

You can see that the alarm control module (which we call Controller) does not communicate directly with other modules, but through Kafka; Monitor kanban edit alarm task, and alarm control module consume alarm task; The alarm control module performs alarm tasks through the Kafka control Inspector. role

“Issue” is an issue that is abstracted from the main error information feature, but it does not explain how the issue is abstracted from the main error information feature. Here is a simple explanation for everyone, taking JS error as an example:

JS errors are reported using Tracekit for standardization, so as to obtain error information with a uniform structure. Then errors are judged and classified according to the following reference fields:

Error types: TypeError, SyntaxError, etc. ErrorMessage: ErrorMessage. Trigger function: onerror/unhandlerejection; Stack information. Here is a display of the system ISSUE:

Each status update of an ISSUE is recorded

The role of Kafka

The main functions of Kafka include:

Storage and distribution of task queues; Communication between modules of the system. Why not use real-time communication (socketIO/RPC)? In general, for the sake of system stability

Kafka caches communication information and relieves system stress. Compared to real-time communication, Kakfa has a more stable alarm task and a more guaranteed arrival rate: if a single-node service, such as a task controller, goes down, the service can still consume the information cached in Kafka for delivery once it starts. Of course, there are alternatives to kafka, such as Redis/RabbitMQ etc.

Alarm task design

Alarm task design is a relatively important part in the alarm system, which is roughly abstracted into two parts:

Task execution rule (or execution frequency) Task judgment rule (or alarm triggering rule) judgment rules: that is, whether multiple conditions form an alarm task, how to query the relationship between multiple conditions: ES Query or SQL query, or other query result calculation rule data structure example

Here is an example of a simple alarm task data structure:

The alarm task can be manually set and controlled

Task execution then comes the final module in the system, the task execution module (called Inspector):

Task executor is relatively simple, mainly used to execute alarm tasks distributed by the controller, query the reported online data, generate corresponding task results according to the judgment rules of the task and return them to the task controller through Kafka. Since there may be many alarm tasks, cluster deployment is adopted, and multiple nodes disperse the task pressure:

conclusion

Finally, the data flow diagram between modules given before is released. I don’t know if you are clear about it:

The main modules and data flow of the entire system are on this map.

This is the end of my speech, thank you!

Small companies built their own front-end monitoring buried point system, proved feasible

Related Posts

RecyclerView basic use

IOS App Support Technical Support

Vue plus Element-UI background frame building