The front end early chat conference, the new starting point of the front end growth, held jointly with the Nuggets. Add wechat CodingDreamer into the exclusive inner push group, win in the new starting line.


The 14th | front-end growth promotion, 8-29 will be broadcast live, 9 lecturer (ant gold suit/tax friends, etc.), point I get on the bus 👉 (registration address) :


The text is as follows

In last year, there was a text version of the share, at that time the system has not been fully built, you can combine the two together to see: technology exploration: 60 days of rapid self-research – front-end buried point monitoring and tracking system Prodigal son.

This article is the fifth session – front-end monitoring system construction lecturer – Jimmy sharing – brief lecture version (see the full version of the video) :

preface

It has been about 5 years since Song Xiaocai’s first APP was launched, and our terminal application type has gradually changed from RN APP to RN APP/PC/H5/ small program and other types of coexistence. We started to develop our own since 19 years agoBurial site monitoring systemAt the end of 2015, it was initially launched and has been running until now. This time, I will introduce how we developed the current burial site monitoring system from the following aspects:





  • The first part: briefly talk about why we want to self-research this system and some ideas of designing and developing this system;
  • The second part: briefly introduce some SDK implementation;
  • Part three: describe how the reported log is handled and the potholes stepped;
  • Part four: brief introduction and demonstration of monitoring Kanban;
  • Part five: introduce the design of our alarm controller;
  • Part six: A brief talk about the task actuator.

Design ideas

First of all, let’s talk about how we designed this monitoring system.

Basic purpose of front-end monitoring




The basic purpose of front-end monitoring in our opinion is the following:

  • Usage of the developed application: whether users are using it and how many users are using it;
  • What kind of problems the users encountered in the process of using;
  • How to track and address these issues as developers and operators;
  • And learn from it to avoid reoffending;
  • Buried data feeds business: what operations and product owners can get out of it to optimize product quality.


The history of

Based on the above thinking about the monitoring system, the side dish front-end gradually improved its own monitoring system in the past five years, which is almost a simple history of our monitoring system:



The consideration of R&D cost is the root cause of the above development history:

  • In the early stage of manpower shortage, we still relied on three party tools, namely Friends and Countly. At this time, the technology stack used was mainly RN and native Android applications. Since RN was a relatively new technology in 15 years, there was no mature RN burial monitoring tool on the market. Therefore, we made magic changes to the three-party SDK to achieve this goal;
  • In the second stage when we have to apply all switch to RN development, then gave up the first phase of the deformity, use the simpler half since the research plan, then buried point and monitoring are separate, use Bugsnag monitoring aspect, but its have restrictions on data synchronization, data report processing is completed and the back-end students cooperation, The storage medium is MongoDB. Due to flexible data structure design and large data storage capacity, there is a bottleneck in data processing in the future.
  • In the third stage, due to the rapid development of our business, the front-end application was not limited to RN APP, but also included wechat small programs and a large number of PC/H5 applications. The number and frequency of H5 applications gradually exceeded THAT of RN applications, so we began the research and development of the current multi-terminal buried point monitoring system. At present, it covers all front-end applications and some back-end applications. The system has been in use since it was launched in October, 2013.

Basic module

Next, we should consider how to design the system. The system includes two parts: burial point and monitoring. Since the topic of this time is monitoring, the basic modules related to the monitoring system are discussed here:



  • Collection module: how data should be collected, which end to collect, which data;
  • Storage: how data should be stored, what data structures should be reported and saved;
  • Alarm: how the alarm system should be designed, how to sniff the error, how to notify the responsible person;
  • Exception management: How to classify and manage reported exceptions.
  • Display: Summarize the abnormal situation and show it to users.


System architecture

Here is the current infrastructure of the system:



The client

At present, the client covers PC/H5, RN applications and small programs. As node applications are rarely used in business, considering the input-output ratio, SDK development has not been carried out.

Log processing

Log processing goes through three layers:

  • In the first layer, considering the large traffic, the pressure is dispersed by clustering and the data is processed for the first time, which will be discussed in the following chapters.
  • The second layer uses the Kafka cluster for buffer to reduce the pressure on ES to write logs. It also has the function of caching data to avoid data loss caused by ES downtime. Filebeat is the backup in case of Kafka problems.
  • Raw data is processed and stored in Elasticsearch.

The data processing

Here’s the third layer of log processing:

  • The buried point data on the end will be stored in the data warehouse after processing, which is the work of the students in the back end;
  • The error data collected from the front end is processed in the background of the monitoring system, which will be discussed later.

The data show

  • First, the display of buried point data, which is stored in the data warehouse after being processed by students at the back end, and then displayed by our front-end visual report system;
  • The second is the display of monitoring error data, which is handled by the monitoring Kanban of the monitoring system.


Monitors the data flow among the modules of the system

The following is the whole process of data flow from data reporting to display in the monitoring system



The SDK to achieve

Here’s a quick look at the SDK implementation





What data to collect

The first thing to consider is what data should be collected







Although it is to monitor the error data, sometimes we need to use the user’s behavior data in order to analyze the cause of the error or reproduce the scene where the error occurs, so the data that needs to be collected includesTwo aspects::



User behavior data:

  • User page operation: click, slide, etc.
  • Page jump: SPA/MPA page jump, APP/ small program page switch, etc.
  • Network request;
  • Console output: console.log/error or Android log;
  • Custom event reporting.

Wrong data

  • Back-end interface error:
    • Front-end problems caused by unavailable services
    • Service data errors cause front-end errors caused by data processing errors
  • Front-end JS error:
    • Grammar mistakes
    • Compatibility errors, etc
  • APP Native error, etc

The SDK to achieve

For time’s sake, here’s a simple implementation of the two-end SDK:



RN SDK







Small dish RN application is relatively pure RN application, so RN SDK implementation can be simply divided intoAt both ends:



JS end

  • Error catching: RN already provides a convenient API
    • ErrorUtils. SetGlobalHandler is similar to the browser’s window. Onerror, used to capture the JS runtime errors
    • Promise. RejectionTracking Unhandledrejection events is similar to the browser
    • It also provides custom error reporting that developers can try/catch and then call during development
  • Network request: proxy XMLHttpRequest send/open/ onLoad methods
  • The routing component of the RN application is a custom wrapper for the react-Navigation component. We just need to use screenTracking in onStateChange or in the Redux integration

Native client

  • IOS uses KSCrash for log collection, which can be symbolized locally
  • The captured data (including JS and Native) is reported in a unified manner

The code shows Promise. RejectionTracking code display

const tracking = require("promise/setimmediate/rejection-tracking");
  tracking.disable();
  tracking.enable({
    allRejections: true.    onHandled: (a)= > {
 // We do nothing  },  onUnhandled: (id: any, error: any) = > {  const stack = computeStackTrace(error);  stack.mechanism = 'unhandledrejection';  stack.data = {  id  }  // tslint:disable-next-line: strict-type-predicates  if(! stack.stack) { stack.stack = [];  }  Col.trackError(stringifyMsg(stack))  }  }); Copy the code

Wechat applet SDK here briefly talk about the two aspects of wechat applet SDK implementation: Network request proxy global object WX WX. Request method:

import "miniprogram-api-typings"

export const wrapRequest = (a)= > {
  const originRequest = wx.request;
  wx.request = function(. args:any[]) :WechatMiniprogram.RequestTask {
 // get request data  return originRequest.apply(this. args); //   } } Copy the code

Page Jump overrides the Page/Component object, proxying its lifecycle methods:

/* global Page Component */

function captureOnLoad(args) {
  console.log('do what you want to do', args)
}
 function captureOnShow(args) {  console.log('do what you want to do', args) }  function colProxy(target, method, customMethod) {  const originMethod = target[method]  if(target[method]){  target[method] = function() { CustomMethod (...arguments)  originMethod.apply(this.arguments)  }  } }  // Page const originalPage = Page Page = function (opt) {  colProxy(opt.methods, 'onLoad', captureOnLoad)  colProxy(opt.methods, 'onShow', captureOnShow)  originalPage.apply(this.arguments) } // Component const originalComponent = Component Component = function (opt) {  colProxy(opt.methods, 'onLoad', captureOnLoad)  colProxy(opt.methods, 'onShow', captureOnShow)  originalComponent.apply(this.arguments) }  Copy the code

Log processing

Here’s how we do log processing:





System structure and function

Let’s first show the basic structure of the log processing module (we call it log-transfer) :



structure

Data reporting is the first layer of processing raw data and contains the following features

  • Multi-node mode: Considering the data traffic can be large, multiple nodes are used to share the pressure (here each node is containerized)
  • Each node has a Filebeat as backup: For the stability of Kafka for subsequent processing, a Filebeat has been added as a backup

role

  • Decrypt the reported data and verify its correctness;
  • Processing some of the data fields to avoid data loss, which will be mentioned later;
  • Add some fields that cannot be obtained by the client SDK, such as IP;
  • Forward processed data.

The main points

There are many points that need to be paid attention to in the processing of log reporting. Here are some simple points to say:





  • Due to the large amount of data, all data is not written in an index. In this case, it is necessary to create indexes to save data on time or on a daily basis
  • Because of the above, we need to create a fixed index template. Therefore, the data type of a certain field must be uniform. Otherwise, the data will fail to be saved.
  • Index template is the premise of all the report has a unified data structure, but due to the different characteristics of impossible to report all the fields are completely unified, there is a constant field and change (that is, based on the types of special field), which fields become which fields need to design systems measure, namely to maintain moderate flexible:



  • If JSON data is used for reporting and is still saved as JSON data in ES, although there is an index template, new fields will be generated when the obtained fields are not taken care of by the template and reported, thus causing an explosion in the number of fields. For example, after the iOS Native error is symbolized, many fields will be generated for reporting. Handling extra fields is one of the functions of log-transfer that we’ll talk about in this section.


Monitor kanban

Here’s a quick look at monitoring Kanban:



show

On the front line are some simple screenshots showing the monitoring kanban



role






The functions of Kanban include:

  • Real-time PU/V viewing;
  • Real-time error log viewing;
  • Issue management;
  • Check and edit alarm tasks.

Issue processing Process

What is the issue? That is to summarize and abstract the data after the reported errors of the same type, so that we can track and process the errors of the same type.







An Issue has a clear life cycle. Here is our Issue life cycle:

  • For issues that do not need to be handled, you can set them directly to the ignore state.
  • If the assigned issue cannot be handled by itself, it can be transferred to others;
  • An issue after processing will not be closed immediately, and the system will be verified online after release, in case of recurrence, it will reopen.


Here is an example of error details:







The stack information converted using source Map is also shown in the error details.

Alarm control

Let’s talk about how we design and develop the alarm control module:



Structure and function

structure



The first is the system architecture:





  • As you can see, the alarm control module (which we call the Controller) does not communicate with other modules directly, but through Kafka;
  • Monitor kanban to edit the alarm task, and the alarm control module to consume the alarm task;
  • The alarm control module uses the Kafka control Inspector to perform alarm tasks.


role





The main points

The error messageCharacteristics of the



The previous explanation of what is an issue, but did not explain how the issue is abstracted, here to give you a simple explanation, take JS error as an example:



When reporting JS errors, we use Tracekit to standardize, so as to get the error information with a uniform structure, and then judge and classify the errors accordingly. The reference fields include:

  • Error types: such as TypeError, SyntaxError, etc.
  • ErrorMessage: indicates the ErrorMessage.
  • Trigger function: onerror/unhandlerejection;
  • Stack information.

Here is a demo of the system ISSUE:

Every status update of the ISSUE is recorded

KafkaThe role of







The main functions of Kafka include:

  • Task queue storage and distribution;
  • Communication between system modules.

Why not use real-time communication (socketIO/RPC)? In general, for the sake of system stability

  • Kafka has the function of caching communication information and relieving system pressure;
  • Compared to real-time communication, Kakfa is more stable in alerting tasks, and the arrival rate is guaranteed: if a single node service, such as task controller, goes down, the previously cached messages in Kafka can still be consumed after the service has started.

Of course, there are alternatives to Kafka, such as Redis/RabbitMQ.



Alarm taskdesign







Alarm task design is an important part in the alarm system, which is roughly abstracted into two parts:

  • Task execution rules (or execution frequency)
  • Task judgment rules (or rules that trigger alarms)
    • Judgment rules: whether multiple conditions form an alarm task, and what is the relationship between multiple conditions
    • Query type: ES Query or SQL query, or something else
    • Query result calculation rules

The following is a simple example of an alarm task data structure:









The alarm task can be set and controlled manually

Task execution


Then there is the last module in the system, the task execution module (we call it Inspector):



The task executer is relatively simple. It is mainly used to execute the alarm task distributed by the controller, query the online data reported and generate the corresponding task results according to the judgment rules of the task and return to the task controller through Kafka. Because there may be many alarm tasks, the cluster deployment is adopted, and the task pressure is distributed among multiple nodes:

conclusion






Finally, it is released between the modules given earlierData flow chart, I don’t know if you are clear:





The main modules and data flows of the entire system are shown in this diagram.

This is the end of my speech, thank you!



This article was typeset using MDNICE