The author

Chang Yaoguo, Tencent SRE expert, is now working in PCG- Big Data Platform Department, responsible for cloud, monitoring and automation of ten million QPS business.

background

BeaconLogServer is the data reporting entrance of lighthouse SDK, receiving data reporting of many businesses, including micro vision, QQ, Tencent Video, QQ browser, Application Bao and other businesses, presenting problems such as large concurrency, large requests, sudden increase in traffic, etc. At present, the QPS of BeaconLogServer reaches more than ten million level. In order to deal with these problems, a lot of manpower is usually needed to maintain the capacity level of the service. How to use the upper cloud to achieve zero manpower operation and maintenance is analyzed in this paper.

Hybrid cloud elastic scaling

Overall effect of elastic expansion

First of all, let’s talk about automatic expansion and shrinkage. The following figure is the schematic diagram of elastic scaling design for BeaconLogServer hybrid cloud

Elastic expansion scheme

Resource management

Start with resource management. At present, the number of BeaconLogServer nodes is more than 8000, which requires a lot of resources. Depending on the public resources of the platform alone, it may not be able to achieve rapid expansion when the traffic surges in some holidays. We used a hybrid cloud approach to solve this problem. In the BLS service scenario, the traffic surge occurs in the following two situations:

  • The daily service load increases slightly and lasts for a short time

  • During the Spring Festival, the service load increases sharply and lasts for a period of time. For the above service scenarios, we use three resource types to cope with different scenarios, as described in the following table:

type scenario set
Common Resource Pool Daily business bls.sh.1
Calculate the force platform Small peak bls.sh.2
Dedicated Resource Pool Spring Festival bls.sh.3

For daily services, we use public resource pools and computing power resources. When the service load increases slightly, we use computing power resources to rapidly expand the service capacity to ensure that the service capacity watermark does not exceed the security threshold. During the Spring Festival, a dedicated resource pool needs to be created to cope with the traffic increase.

Elastic expansion and contraction capacity

Resource management has been described above. When should the capacity of different resources be expanded or reduced?

BeaconLogServer daily traffic distribution is 123 platform common resources: computing power platform =7:3. The current automatic capacity expansion threshold is 60%. When the CPU usage is greater than 60%, the platform automatically expands capacity. The elastic expansion capacity depends on the scheduling function of the 123 platform. Specific specifications are as follows:

type Automatic CPU capacity reduction threshold Automatic CPU capacity expansion threshold Minimum number of copies Maximum number of copies
123 Platform public resource pool 20 60 300 1000
Calculate the force platform 40 50 300 1000
123 Platform dedicated resource pool 20 60 300 1000

It can be seen that the automatic capacity reduction threshold of the computing platform is large, but the automatic capacity expansion threshold is small. The computing platform is mainly considered to cope with the sudden increase of traffic, and the resources of the computing platform are frequently replaced. Therefore, the capacity expansion or capacity reduction of the computing platform should be prioritized. The minimum number of copies is the minimum resource requirements required to secure the business. If less than this value, the platform will automatically replenish. The maximum number of copies is set to 1000 because the maximum number of RS nodes supported by IAS platform (gateway platform) in a city is 1000.

Problems and Solutions

In the process of implementing the plan, we have also encountered many problems. I would like to share some of them with you.

1) From the point of view of the access layer, TGW was used in the previous access layer business. TGW has a limitation, that is, RS nodes cannot exceed 200. Currently, there are more than 8000 BeaconLogServer nodes. We investigated the access layer IAS, and the number of nodes supported by each city in the fourth layer IAS is 1000, which can basically meet our needs. Based on this, we designed the following solutions:

Generally, the service + region mode is adopted to separate traffic. If the number of RS nodes in a city exceeds 500, services need to be split. For example, if the number of nodes in a public cluster exceeds the threshold, video services with a large volume of current services can be split into a separate cluster. If the number of service nodes in an independent cluster exceeds the threshold, consider adding cities first and split the traffic to new cities. If it is not possible to add a new city, consider adding an IAS cluster at this point, and then distribute the traffic to different clusters by region on GLSB.

2) Different resource pools in the same city set different sets, so how does IAS access different sets in the same city? In fact, we pushed IAS to implement wildcard group matching, e.g. Bls.sh. % to match bls.sh.1, bls.sh.2, bls.sh.3. Note that IAS wildcard is different from Polaris, which uses ***. When IAS launches, it finds that some users use * for separate matches, so it uses % to represent wildcards.

3) The difficulty of resource management is that the nodes on the fourth layer of IAS cannot use the computing resources. After communication, the connection between IAS and computing resources is made. The solution here is to use SNAT capabilities.

Considerations for this scheme

  • You can only bind IP addresses, but cannot pull instances. Instance destruction will not be automatically unbound. You need to use the console or API to actively unbind instances.

  • If it is a large volume: which gateways, which capacities need to be evaluated, risk control, need to be evaluated

Automatic handling of single machine failures

Single machine fault handling effect

Single machine fault automatic processing, the goal is to achieve 0 human maintenance, the following picture is the screenshots of our automatic processing.

Solution for troubleshooting single machine faults

Single-node faults are mainly considered from the system and service levels. Details are as follows:

The dimension The alarm item
The system level CPU
The system level memory
The system level network
The system level disk
Business level The ATTA Agent is unavailable
Business level The queue too long
Business level Success rate of sending ATTA data

For standalone failures, we used Polaris (registry) from Prometheus +, an open source company. Prometheus collects data and sends alarms, and then removes nodes from Polaris by code.

As for the processing of alarm occurrence and alarm recovery, when the alarm occurs, the number of alarm nodes will be judged first. If the number is less than three, we will directly remove the nodes in Polaris. If the number is more than three, it may be a common problem, at this time we will send an alarm, requiring manual intervention. When the alarm is cleared, we directly restart the node on the platform, and the node will re-register Polaris.

ATTA Agent exception handling

As shown in the figure, the handling process consists of two lines: alarm triggering and alarm clearing. When services are abnormal, determine the number of abnormal nodes first to ensure that nodes will not be removed in a large scale. The node is then removed at Polaris. When services are recovered, restart the node.

Problems and Solutions

The main difficulty is the health check of Prometheus Agent and the dynamic change of BeaconLogServer node. For the first problem, it is mainly maintained by the platform side. For the second problem, we implemented this using the timing script’s ability to pull nodes from Polaris and heat load from Prometheus.

conclusion

This time, shangyun effectively solves the two problems of automatic capacity expansion and single machine failure, reduces manual operation, reduces the risk of human operation error, and improves the stability of the system. Through the cloud, also summed up several points:

  • Migration scheme: before the upper cloud, make a survey of the migration scheme, especially the functions supported by the system, so as to reduce the systemic risk caused by the unsupported system in the migration process.

  • Migration process: monitor indicators well. After migrating traffic, observe indicators in time and roll back in time when problems occur.

About us

More about cloud native cases and knowledge, can pay attention to the same name [Tencent cloud native] public account ~

Benefits:

① Public account background reply [Manual], you can get “Tencent Cloud native Roadmap manual” & “Tencent Cloud native Best Practices” ~

② Public number background reply [series], can get “15 series of 100+ ultra practical cloud original dry goods collection”, including Kubernetes cost reduction and efficiency, K8s performance optimization practices, best practices and other series.

③ Public account background reply [white paper], you can get “Tencent Cloud container Security White Paper” & “Source of Cost reduction – Cloud native Cost Management White Paper V1.0”