Welcome to Tencent cloud community, get more Tencent mass technology practice dry goods oh ~

Author: Wu Shusheng, senior engineer of Tencent, responsible for the construction of SNG big data monitoring platform. With nearly ten years of experience in monitoring system development, I have developed massive and highly available distributed monitoring system based on big data platform.

Introduction: the current SNG all-link log monitoring platform has a daily data storage capacity of 10TB, a compression ratio of 1/10, and a peak traffic of 30GB/s. See what technical difficulties were encountered in building such a platform and how they were resolved.

background

In the microservice and distributed environment, all-link log monitoring can effectively improve the efficiency of problem location and analysis, and become a development and operation tool. There are already open source solutions and mature vendors.

For example, Twitter’s Zipkin designed and developed a distributed tracking system based on Google’s Dapper paper, which is used to collect logs and time consuming information between processing nodes to help users troubleshoot abnormal links of request links.

Zipkin is easy to access in business units that have a unified RPC middleware framework. However, Tencent SNG full-link log monitoring platform (later Full-link) faces more complex actual business scenarios, and the realization of full-link log monitoring encounters more challenges. The selection of full-link technology has undergone changes from open source components to self-development.

The SNG link-wide log monitoring platform has been connected to space and video cloud service log data. Daily data storage capacity is 10TB, 1/10 compression ratio can be achieved, peak flow 30GB/s.

Let’s share a case scenario:

On August 31, 2017, the indicator of X service module was abnormal from 21:40 to 21:50, and the success rate decreased from 99.988% to 97.325%. As shown below:

After receiving the abnormal success rate alarm, the success rate of the iPhone client for spatial voD service decreases by drilling down the image in the multidimensional monitoring system, and the return code is -310110004. The diagram below:

After the abnormal cause is found through the multi-dimensional data analysis of the large market, the context of the abnormal user needs to be further analyzed due to the APP problem. Therefore, you need to view the link-wide log data of the abnormal user at the abnormal point in time. In the full-link view, you can view the queried user logs and operation processes that meet abnormal conditions.

The above is an anomaly analysis case from surface to point.

Usage scenarios

There are three types of all-link log monitoring scenarios:

  • One is case analysis, mainly dealing with user complaints and abnormal analysis from surface to point;
  • The second is development and debugging, which is mainly used to check the logs of associated modules and serve as the clue of test bill of lading in the development process.
  • The third is alarm monitoring, which mainly extracts dimensions from log data and statistics into multidimensional data for anomaly detection and cause analysis.

Challenges encountered

When constructing the whole link log monitoring platform, the monitoring module has gone through the transformation from traditional monitoring and quality statistics to big data multi-dimensional monitoring platform. We have also encountered challenges in business scenarios when we tread on large data suites.

Business Diversity Challenges

QQ system has a variety of services, such as: hand Q, space, live, on-demand, membership, etc. These businesses generate different styles of logging formats, and there is no consistent RPC middleware framework. This context dictates that the system needs to support flexible log formats and multiple collection methods.

Massive Data Challenge

At the same time, status data reported by more than 200 million online users has a daily storage capacity of more than 10 TB and a bandwidth of more than 30GB/s. Stable and efficient data processing, high performance and low cost data storage services are needed. After prototyping with open source components, we gradually encountered performance bottlenecks and stability challenges, which drove us to gradually replace open source components through self-development.

The challenge

Diversity of logs:

The value of logs is not only query, but also statistical analysis and alarm detection. To this end, log data is normalized and distributed to a multi-dimensional monitoring platform. Reuse existing capabilities of the monitoring platform.

Based on the accumulated experience in monitoring platform development, the design of all-link log monitoring platform is to learn from each other’s strengths. Solve the cost, performance, and stability bottlenecks of open source storage components by developing a log storage platform.

Our all-link log monitoring platform supports four data formats, namely delimiter, regular parsing, JSON format and API reporting:

Delimiters, regular parsing, and JSON formats are flexible for non-invasive data collection. However, the log parsing performance of the server is low, and the data parsing of delimiters can only achieve 4W/s processing performance. The API method can achieve 10W/s processing performance. For internal services, we recommend using a unified logging component and embedding apis to report data.

System Automatic DISASTER recovery and capacity expansion

For a massive log monitoring system, stateless module design is the first step to achieve automatic disaster recovery and capacity expansion. For example, the system access module, parsing module, and processing module, which do not need state synchronization, can be deployed independently to provide services.

However, for such stateless service modules, abnormal link deletion mechanism should be added. That is, if a node in the middle of a data processing link is abnormal, the services provided by other nodes after the node are invalid and the current link needs to be interrupted. There are various mechanisms for deleting abnormal links. For example, the ZK heartbeat mechanism is used to delete abnormal links.

To avoid relying on too many components, we made a heartbeat mechanism with state. Upstream node A periodically sends heartbeat detection requests to downstream node B at an interval of 6 seconds. B replies the heartbeat request with its own service availability status and link status. After upstream node A receives the unavailable status on the heartbeat band of B, if no other node is available downstream of node A, the downstream link status of node A is also unavailable. The heartbeat status is transferred in sequence and the entire link is automatically disabled. Stateful services are typically storage services. Such services implement DISASTER recovery (Dr) using the active/standby mechanism. If only one master is allowed to provide services at a time, the zK election mechanism can be used to implement the active/standby switchover.

The second step for automatic DISASTER recovery and capacity expansion is to implement name service and load balancing through the routing mechanism. Using the open source component ZooKeeper, you can quickly implement the name service feature. The registration logic should be implemented on the server side and the routing overload logic should be implemented on the client side.

Data channel Disaster Recovery

We use two mechanisms: double-write and message queuing.

● For monitoring data with high data quality requirements, dual write is adopted. This approach requires the backend to have sufficient resources to handle peak requests. The capabilities provided are low latency and efficient data processing capabilities. ● Log data is implemented by message queue with data disaster recovery capability. The selection schemes used are Kafka and RabbitMQ +mongodb.

Message queuing can cope with high throughput log data and has the effect of peak cutting. The side effect is that the data delay is long during peak hours, which cannot meet real-time alarm monitoring requirements. Message queues also need to avoid message backlog caused by queue exceptions.

For example, in a Kafka cluster, if the message accumulation exceeds the disk capacity, the throughput of the entire queue decreases, affecting data quality.

We went to RabbitMQ +mongodb. Data forms a data block at the access layer by 10,000 pieces or 30 seconds. Write the database randomly to a cluster consisting of multiple mongodb instances. Write the mongodb IP and key to rabbitMQ. After the back-end processing cluster obtains the information to be consumed from RabbitMQ, it reads the data from the corresponding mongodb node and deletes the data. The system periodically collects statistics on the message backlog of RabbitMQ and mongodb. If the message backlog exceeds the threshold, the system automatically clears messages.

▼ query

In order to cope with efficient and low-cost query, log storage scheme is implemented by self-research. Data reported in the whole link is fragmented with hash based on the user ID or request ID as the primary key. The fragmented data is accumulated in the cache module for 1 minute or 1 meter and written to the file server cluster. After the file is written to the cluster, the mapping between the hash value and the file path is written to ElasticSearch.

Querying data provides two types of capabilities:

The first type is query by primary key. The query mode is to calculate the hash value of the query key, retrieve the file path from ES, and send it to the query module for filtering and searching.

The second type of query capability is non-primary key keyword lookup. Based on service scenarios, you can query logs containing keywords only. The starting point of this strategy is to balance query performance and avoid retrieving full text. That is, the first query 1000 files, if there is a query result, stop the subsequent query. If no query result is returned, the system increments the search for 2000 files until the search for 100,000 files ends.

To meet a variety of business scenarios. We abstract the ETL capability in the data processing module to achieve plug-in extension and configurable implementation. It also provides unified task management and cluster management capabilities.

conclusion

The following experiences can be used for reference in the development process of full-link log monitoring:

  • Building primary business functions with mature open source components;
  • In the process of business operation, improve the processing capacity and stability of the system by modifying open source components or self-research, reduce operating costs and improve operation and maintenance efficiency;
  • Using stateless and routing load balancing capability to achieve standardization;
  • Abstract and refine functional models to establish platform capabilities to meet diverse business needs.

Ps: Attached is the latest recruitment information, students in need can contact us

[Job] Operation and development

[Location] Shenzhen

[Job Responsibilities] Responsible for the construction of cloud monitoring system, investigating excellent monitoring schemes in the industry, deeply understanding user needs, planning and implementing the upgrade and evolution of monitoring system. Responsible for the design and construction of massive data storage architecture for cloud monitoring. Deeply master the current mass storage architecture of Cloud monitoring, investigate and compare outstanding storage schemes in the industry, and optimize and upgrade the existing architecture. Responsible for the design and construction of massive data processing architecture for cloud monitoring. Deeply grasped the current massive data processing framework of cloud monitoring, investigated and compared outstanding data processing frameworks in the industry, and optimized and upgraded the existing framework. Responsible for the development of cloud monitoring functions, to create excellent monitoring solutions in the industry.

Bachelor degree or above in computer related field, 3 years software development experience is preferred. Deep understanding of Linux system, familiar with TCP/IP network principle, HTTP protocol working principle. Proficient in any of the following languages, C/C++, Java, Python, in-depth understanding of agile development concepts and practical experience. Familiar with network programming framework and open source message queue, mysql, Redis and other databases. Experience in big data system development is preferred, such as in-depth knowledge of JStorm, Hadoop, Zookeepe, Druid and other big data suite. Strong technical research ability, logical thinking ability, communication and execution ability; Good spirit of cooperation and enthusiasm for learning.

If you are interested in this position, please send your resume to [email protected].

reading

Tencent Cloud Game-tech Salon — Global GAME voice solution Tencent cloud Game-tech Salon work review — Tencent GAME cloud ecological product planning and the latest progress to avoid 1.26 billion marketing resources plunder, Tencent cloud intelligent security preparation double 12

This article has been published by Tencent Cloud Technology community authorized by the author

The original link: cloud.tencent.com/community/a…

Massive technical practical experience, all in Tencent cloud community!