A background.

Real-time monitoring screen, as a classic in the field of big data applications, has been widespread in the case of the application of Internet companies, like taobao real-time sales double 11 scenes, drops of user’s geographical distribution, etc, so on the good future in the business system, there are also many core scenarios is the need to undertake unity through real-time screen monitor, This document describes the real-time monitoring large-screen solution provided in Future Cloud – Service Monitoring.

Ii. Key Points

The construction of real-time large screen is divided into technical and product parts. The following are some key points and difficulties in the project:

1. The technology

  • Second – level real-time indicator calculation of massive logs
  • Calculation guarantee when log traffic is tidal
  • Storage selection of different types of indicators

2. Products

  • Which services require real-time large screen
  • What metrics need to be displayed on the big screen
  • How do you define the value of a real-time large screen

3. Technical architecture

1. Detailed architecture

  • Source layer: The source layer contains all core log information and service data associated with target services, including:
    • Client logs: user buried information, behavior operations, network requests, crash exceptions, etc.
    • Link log: logs of key steps involved in request and invocation links, including DCDN, source station, and server Trace.
    • Service logs: Key log information of business events, mapped to traces of key business actions;
    • Business data: Multi-dimensional basic business information in a structured database, such as users, courses, orders, etc
  • Transport layer: The transport layer mainly relies on the collection channel of the future cloud-log center to collect and distribute all source data at the source layer in a unified manner. Unified collection registration is required on the log center platform to meet the metadata standards for collection and delivery.
  • Processing layer: The processing layer is mainly based on distributed real-time computing engine for data processing. Currently, the tasks are mainly distributed in Flink cluster and Spark cluster:
    • Flink: a flow processing engine. The flow processing engine has low latency, supports SQL, and has two deployment modes: on K8s and on Yarn.
    • Spark: a batch processing engine. It is micro-batch and has low latency requirements. It supports SQL and is currently deployed on Yarn.
  • Storage layer: The storage layer mainly supports real-time OLAP scenarios and scenarios with large tidal flow. After selection, we build based on the following two components:
    • ClickHouse: a column-oriented DBMS, distributed architecture, efficient real-time data updating, SQL support, rich functions, but no update or transaction support.
    • Redis: in-memory database, excellent performance, atomic operation, can quickly support data read and write in second-level and millisecond level scenarios, does not support SQL, the cost of data modeling is slightly higher, currently based on Twemproxy cluster solution, mainly used as cache and large volume of violent real-time data index request;
  • Display layer: back-end service and storage layer for logical interaction, data assembly, data interaction and display with the front end, and front end interaction protocol mainly include:
    • Http: indicates the data refresh frequency at the minute level or higher when the client initiates the request. It is applicable to scenarios with low request volume and frequency.
    • Websocket: server push, refresh frequency of seconds or less, such as second-level sales volume, second-level online number, second-level alarm push, etc.

2. Experience sharing

  • Real-time stability assurance for tidal scenarios

    At present, most of the business models we can reach are tidal. Taking online schools as an example, the daily peak of live broadcast online is mainly concentrated in the fixed time period during the day. When there are no courses, the peak and peak of log are quite different. Our estimate of log volume cannot simply be linearly correlated with business data. In this case, the stability of computing tasks can be improved in the following directions:

    1. Pressure test

      Real-time calculation of pressure test and business test has certain difference, under the streaming data calculation, data real-time QPS, the size of a single data, the types of key fields, will affect the stability of the computing tasks, pure analog pressure regulating log, may not be the real pressure out of the bottleneck of computing tasks and problems, so we at the time of pressure test, Can choose the real log of online during peak hours, combined with the task scenario, create a reasonable scenario under a variety of data skew case for pressure measurement, in addition to the slack period pressure measurement, we will be within the scope of the controllable capacity, peak time in business, with the minimum unit of computing resources to height of test task performance, and whether it can match the result of the pressure measuring.

    2. Task scheduling

      In small business flow, a lot of our real-time task is healthy, when business log peak, peak overlapping, often forms computing tasks not only focus on logic, this time also to pay attention to physical deployment environment, we should ensure that the task network overhead, the smallest and upstream and downstream to communicate, ensure data flow within a reasonable physical rooms inside the closed loop, Try not to do too long network transmission behavior, once there is cross-room pull behavior, the dedicated line bandwidth between the rooms is cross-used, it is likely to generate joint problems, resulting in greater accidents.

    3. Resource scale

      If the pressure measurement is sufficient and the task deployment is reasonable, there will always be a scenario where the “estimated” peak value is exceeded. In this scenario, we can use the capability of the computing engine layer to help solve the problem. Based on the dynamic resource mechanism of Spark on Yarn, tasks can expand resources by themselves within their own queue resources. We are also testing the native mode based on Flink on K8S to achieve dynamic resource scaling. On the basis of the above, we will still reserve a certain amount of resource buffer for the resource peak for redundancy, and will not completely rely on resource expansion to resist the peak. The necessary expansion of resources in front is indispensable.

  • Storage selection of different indicators

    Real-time large screen will have a variety of dimensions, a variety of aging indicator data, for different indicators, also need to have a reasonable storage selection:

    1. Second-level indicators: ClickHouse and Redis both have enough write performance to support second-level metrics such as second-level QPS, reqTime, error rates, etc., but when reading, we have to make scene discrimination. If there is a large amount of data on the page, frequent simple requests, and no complex statistics or dimension correlation, We use Redis, for example, for owl monitoring, and ClickHouse for links-associated queries, complex algorithms, and simultaneous analysis, so we can use SQL and its own rich functions to support our business needs.
    2. Minute level + Index: ClickHouse is used to store most of these indicators for analysis purposes. If the data source is in mysql, we can read the indicators directly from the library if the peak value of the data is not high. If the peak value of mysql is high, we can read the indicators directly from the library. We do incremental calculations using real-time binlog capture and write ClickHouse for reading.

Product thinking

In terms of large-screen product construction, we have also gained a lot of thinking and experience, which will be introduced centering on the key points mentioned in section 2:

  • Which services require real-time large screen

    Real-time large screens have two most important features:

    First is real-time, see the data is following the time change, through domestic attention to the desired user is the latest situation of the information and problems, so the timeliness is very sensitive to the business of real-time monitoring screen is suitable for construction, such as live classroom, during a live class, all the user experience, are concentrated in one or two hours, but this time, It is necessary to find problems quickly and accurately to make judgments and operations. When the time window is opened, there will be a short peak of requests, which will put great pressure on the system. Problems will directly affect users’ real experience of purchasing courses, and there will be a risk of losing users. Therefore, real-time monitoring of the process and quality of renewal is necessary.

    The second characteristic is highly abstract and the transition, the positioning of real-time screen is not used to the analysis of causes of problems and find the problem, but quickly capture the problem and get the latest information on real-time, so for complex core business, involved in the module, business calls, coupling, the partners involved role also have a lot of time, then the value of time, It will quickly improve the coordination and work efficiency of large-scale operations, we just need to focus on their core indicators of change, simple and clean. Based on the summary of the above characteristics, businesses that meet these two characteristics need to build real-time monitoring large screens to meet the requirements of monitoring.

  • What metrics need to be displayed on the big screen

    If it is clear that a business needs to monitor the big screen in real time, how can we judge which indicators should be picked out and displayed on the big screen? Based on practical experience, we have the following criteria:

    First business indicators, and quickly understand its business architecture, touch from the user side, to the service layer, component layer, resource layer core link, build clear module disassembly, disassemble the whole targeted business for a few core unit, for the business unit, we need to show this business Polaris indicators, such as the number of orders, the number of online, news, etc., For the technical architecture unit, the pressure, water level and real-time quality of the unit should be displayed, such as interface QPS, system capacity, SLA indicators, etc., so that the business indicators and technical indicators can be intuitively correlated by time to assist in problem analysis.

    Secondly, customer complaint index. Outside the core business link, users’ feedback should be considered, such as changes in customer complaint platform, user help platform and other indicators, so as to monitor the growth of user feedback on various problems in real time, intuitively understand the current user experience and quality, and assist in judging problems.

    Is the last warning, for the time the alarm design, in the scope of our warning, we control the indicator on the display screen, wysiwyg, let users need warning indicators according to the threshold is divided into early warning and alarm, on the screen to differentiate with orange, red, when the alarm is triggered, display indicator discoloration, alarm module popup alarm at the same time, For all users who open the large screen at the same time, they will receive the same alarm push and continue to accumulate. After refreshing, the page will not be retained and will be stored in the background.

  • How do you define the value of a real-time large screen

    How to measure the value of a real-time large screen? After some business practices, we have summarized several indicators:

    First of all, it is the number of active online users during the critical time, such as the peak of live broadcast, the peak of renewal and other scenes, whether there is user attention, which involves several types of user role attention, and can directly measure the value of real-time large screen.

    Is second, whether directly or indirectly, to help users find problems, if appear the accident or problem, and related no display on screen and reflect, that there is no doubt that the product is failed, screen should cover the scenario of “trouble”, for user to go on the details, help real-time warning and early warning, can continue to help users to find effective real-time problem of domestic, There is no doubt that it is valuable.

The business case

1. Live classes in online schools

The live classes of large classes in online schools are undoubtedly one of the core business scenarios of online schools, and the modules involved are also complicated to call. Since the winter vacation of 2012, we began to build the large screen for the live classes monitoring in online schools.

  • The monitoring module is split, as shown in the following figure

Pictured above, a live classroom troika, mainly live, messages, and interactive, users enter the studio pull live streams, for message and interaction will be frequent interaction scenarios, the three core unit depth affect the user experience in class, any environmental problems, will result in the user experience is reduced, and even termination behavior in class. Link is to monitor the entire infrastructure of the live class, from the client request to the server call. Through the good observability of the health map, it can quickly insight into which module has a problem and what is the impact of upstream and downstream dependence. The business module is mainly to assist in the analysis and observation of the overall situation. Attendance and estimated attendance can evaluate the overall class trend of the day, and customer complaints can directly feedback the real experience of the live class.

  • The list of monitoring indicators mainly focuses on the business Polaris and stability indicators of each module

    The module

    indicators

    live

    Number of live broadcast online, Top ranking of live broadcast times, delay rate, proportion of delay, and quality of live broadcast by region

    The message

    Number of online chat users, number of messages, arrival rate, message delay, number of chat users at each end, and drop rate

    interactive

    The popularity of on-the-spot interaction, the success rate of each step of the interactive question, the success rate of each end of the interactive question, the number of interactive clients version distribution

    link

    Service health map (QPS/ Success rate) and interface QPS rankings

    business

    Estimate the number of attendance, the number of class buyers, the number of customer complaints and feedback

    After determining each module, we conducted in-depth communication and cooperation with the business team around the scenario and technical architecture of each module, set up several monitoring projects for multi-dimensional dismantling, and defined the business Polaris index and stability index.

  • Large screen effect (core data desensitized)

2. Continue the application business

  • Module indicators are split, as shown in the following figure

    As shown in the above, the pew continue to business, is the activities in the branches, each branch according to continue to plan to continue to the time specified, when the number of more, continue to plan time overlap, can form submitted to the peak, on the system pressure, on the business indicators, we will display all over the country, the live program of the day, Allows the user to clearly see what that day, will participate in the newspaper, in the former, there will be a live activity, we also observed the number of each campus to participate in the live and live caton case, tell time, we will map based on health, real-time observation of whole report system in that real-time pressure and real-time quality, the SLA curve of the day, real-time interface QPS Top and RT Top, The business shows real-time payment peaks, success rates, and renewals to see trends and quality from a business perspective.

  • Large screen effect (core data desensitized)

The future planning

At present, we are learning and thought online school, learning and thought pew, learning and thought 1 v1 division multiple lines of business provides real-time monitoring of domestic service, and formed around a live class, trade, and other core scenario solution templates, can quickly support many business line domestic monitoring service, in the back of the work we will be committed to two directions:

  • It provides more complete self-service capability and uses mature service monitoring solutions to enable users to configure a large screen by themselves and assemble a large screen through simple configuration and dragging.
  • Expand more business scenarios, generate more scene solutions, expect to accumulate experience through more business practices, generate more scene solutions, empower similar businesses, save costs, improve efficiency.

Future cloud – business monitoring is committed to providing full link monitoring solutions, welcome teachers in the direction of real-time data computing, OLAP and data products to join us to empower future business through data!