In this article, we present some of the stablity-related specifications. These specifications are based on a large number of real online failures since Hitch was established five years ago. We hope they can help improve the stability of your service.

As the largest engineering team in Hitch Technology Department, the server side plays a more and more important role with the expansion and iteration of personnel. On the one hand, standardization can improve the quality and efficiency of our delivery. On the other hand, we also hope to continuously summarize and explore the best practices applicable to our team in the actual combat time after time.

Based on this, we developed and promoted a set of executable and minimal engineering specifications applicable to server-side development, including R&D process, stability, performance cost and other aspects.

In this paper, stability related specifications are presented. These specifications are obtained from the review and summary of a large number of real online faults since Hitch was established five years ago, hoping to help improve the stability of everyone’s service.

1

There are many technical terms that will be used in the description below. For easier understanding, here is a brief explanation of these terms:

  • Service classification: according to the business needs, we generally need to divide the smallest system into level 1 service, once there is a problem, we need to follow up the first priority. We define the services that affect the core business indicators (such as order volume, order volume, etc.) as level-1 services, and the others as level-2 services.
  • Preview cluster: A set of environments exactly the same as the deployment of an online production environment, except for wireless traffic, internal access, and closed loop traffic within the cluster.
  • Small traffic cluster: a set of environment that is exactly the same as the deployment of online production environment. Through flow control, only the traffic of a few cities will fall into this cluster, and the traffic in the cluster is closed.
  • Grayscale release: the release process is divided into preview cluster, grayscale city cluster, 10% traffic, 50% traffic, 100% traffic release mechanism to ensure the safe online process.
  • Full link pressure test: A solution for stress testing production environments without affecting online services. To find out the capacity of production environment, bottleneck point, etc.
  • Multi-activity of machine room: through the deployment of multi-machine room, when one machine room fails, the flow can be quickly cut to other machines room, to reduce the loss. It involves traffic routing, traffic closed loop, data synchronization, data consistency, disaster response and many other links of a complete set of solutions.

2. Stability specification

Stability design

  • [Mandatory] The caller must set the timeout, and the timeout of the calling link decreases from top to bottom. It is recommended to set the timeout as follows:

  • [Mandatory] New dependencies of the core process are weak dependencies by default. If new enhanced dependencies are needed, they shall be reviewed and decided;
  • [Compulsory] If a downstream service provides service discovery, all services must access the service through service discovery to facilitate service discovery to control downstream nodes and timeout time;
  • [Compulsory] All internal services must access service discovery, and external services should try to promote access service discovery;
  • [Recommendation] Recommendation framework supports manual one-button fusing of dependent services;
  • [Recommendation] Priority should be given to stateless design in service design;
  • 【 Recommendation 】 It is recommended to consider anti-reentrant when writing interface;
  • [Recommendation] The design principle of the system is simple and reliable, and mature technology is preferred;
  • [Recommendation] Core service mandatory, other recommendations, interface recommendations to set a reasonable flow limiting configuration.

Deployment and Operations

  1. [Mandatory] It is strictly prohibited to directly operate online data without passing interfaces or encapsulation methods in temporary scripts. If necessary, it must be tested by QA;
  2. [Compulsory] Service on-line must go through the on-line platform, and access to the quality platform (including automated case, core graph and other on-line checklists). Compulsory stay for observation
  3. [Compulsory] First-level service shall include preview cluster, small traffic cluster (except some special services) and dual machine room deployment;
  4. [Suggestions] Non-tier 1 online services are recommended to include preview clusters;
  5. [Suggestions] It is recommended to carry out capacity planning when the new service is launched, and it is suggested to verify module capacity through interface pressure test or full flow pressure test.

Monitoring alarm

  1. 【 Mandatory 】 On-line service machine must have basic monitoring alarm, including CPU, IO, memory, disk, coredump, port;
  2. [Mandatory] Online service must have basic service monitoring, including interface QPS, fatal number, time consuming;
  3. [Recommendation] Core business indicators (order volume, order volume, payment volume, etc.) must be monitored and alerted;
  4. [Recommendation] It is necessary to have the overall market of services, which can cover the monitoring of the core modules in this direction, so as to facilitate the rapid positioning of service problems.

Change management

  1. [Compulsory] Any level 1 service change needs to follow the grayscale release mechanism;
  2. [Compulsory] Any level-one service change, including service change and configuration change, shall have corresponding rollback plan to ensure that it can be quickly rolled back when the change is abnormal;
  3. [Suggestions] Try to avoid code hitchhiking online;
  4. [Recommendation] When rolling back the service, it is recommended to roll back the corresponding code and configuration at the same time to ensure the correctness of the main line code;
  5. [Recommendation] For configuration changes, especially for complex configuration changes, it is recommended to add the corresponding configuration verification mechanism.

Budget management

  1. [Compulsory] There must be a plan for multi-active cut flow, and it needs to ensure its effectiveness. Regular drills must be organized, and it is suggested to take place once a month.
  2. [Compulsory] Full link pressure test channel needs to ensure the effectiveness and organize pressure test regularly;
  3. [Compulsory] One-button flow limit plan needs to ensure its effectiveness, and be reviewed and rehearsed regularly;
  4. The de-escalation plan needs to be effective and practiced regularly.

Principle of fault handling

  1. 【 Mandatory 】 In case of fault on line, the first priority must be dealt with;
  2. [Compulsory] When there is a fault on the line, if there is any change, the first time to roll back;
  3. 【 Compulsory 】 If there is a fault on the line, it must be rearranged;
  4. [Compulsory] It is necessary to have the reordering specification, and the reordering shall be executed in accordance with the specification.

3. Stability anti-patterns

This chapter is mainly based on a large number of online fault cases and driven by specific examples. In addition to some stability problems that are easy to be committed in each link of the whole process of research and development, some anti-patterns are extracted for your reference, so as to avoid the same problems in subsequent work and improve the stability of online services.

3.1. Disaster tolerant and fault tolerant design

Anti-pattern 3.1.1 Excessive node circuit break strategy

[Example] In order to improve the success rate of requests, fuse measures are taken for downstream nodes in case of downstream failure. For example, if there are 5 consecutive access errors within 1 minute, this node will be fused and no longer invoked (to be restored after a period of time). One day, the network is jamming, and 3 of the 4 instances of downstream services will enter the fuse mode. This causes all the traffic accessing the downstream to fall on the remaining instance and overwhelms it. Downstream service avalanche, whole system unavailable.

In the circuit breaker mode, the circuit breaker protection measures are also needed to avoid the stability problems caused by excessive circuit breakers.

** Antipattern 3.1.2 Fixed retry sequence **

Each retry sequence is “next”.

One is an avalanche: if the retry sequence of A certain class of query is A B, when A fails,B will suffer twice as much pressure. If B fails, then the next query will also be crushed. If the number of retries is 2, if both A and B are restarted at the same time, then the retry sequence of A and B is bound to have no result.

[Solution] Evaluate new retry algorithms, such as random retry. However, compared with the fixed retry sequence, the random retry sequence may also bring risks to the system, for example, it may reduce the cache hit ratio of the downstream module and reduce the system performance.

** Antipattern 3.1.3 Unreasonable timeout setting **

[Example] The upstream service timeout is not set reasonably, and when the downstream problems occur, the upstream service will be directly dragged down.

[Solution] The timeout time should be set according to the 99min time of the link, and the communication-related configuration of the link should be reviewed periodically.

** Antipattern 3.1.4 does not consider the downstream effects of multiple calls in the same request **

There is no problem in setting the timeout time when a service calls a downstream service, but a downstream service will be called serially for many times in the same request, and when the downstream service fails, the upstream service will also be directly dragged down.

[Solution] In addition to the single timeout for the downstream service, you need to consider the overall timeout for the downstream service.

** Anti-pattern 3.1.5 Unreasonable retry logic **

There are retries in multiple places along the whole link. When the downstream service fails, the retries are amplified, causing a service avalanche.

[Solution] Evaluate the retry mechanism and comb the entire link of request processing to ensure retry converges in one place.

** Antipattern 3.1.6 does not consider the impact of business burrs **

[Example] A business form has a characteristic that there are request spikes at half-hour and hour. Affected by a holiday, the flow of accessing the downstream immediately increases, resulting in an avalanche of downstream services.

[Solution] Balance business spikes to reduce peak traffic impact on downstream services.

** Antipattern 3.1.7 does not tolerate fault handling for exceptional input **

[Example] The business does not carry out fault-tolerant processing for abnormal input, but still handles it according to normal logic, resulting in data chaos.

[Solution] Business, especially business entry, must be fault-tolerant to unreasonable abnormal input to ensure the robustness of the whole system.

** The anti-pattern 3.1.8 interface does not support idempotent design **

[Example] Interface does not support idempotency, and network failure causes a large number of retries, resulting in a large number of errors in the core data.

[Solution] Interfaces need to be designed with idempotency in mind.

** Antipattern 3.1.9 does not make non-core processes weakly dependent **

[Example] There is no weak dependency on the process, which leads to the overall fragility of the system, and the failure of each dependent unit will lead to the failure of the whole business.

[Solution] Regularly comb the process of the system, minimize systematization, and make the non-necessary process as weak as possible.

** Antipattern 3.1.10 does not consider the impact of ID overflow **

When an ID is exceeded by int32, the ID overflows, exporting the service exception.

[Solution] When adding resource-related ID fields, the scope of ID should be considered, and whether there is overflow risk, the resource-related ID fields should be reviewed regularly to prevent in advance and prevent the occurrence of failure to the maximum extent

3.2. Deployment and operation

** Anti-pattern 3.2.1 was deployed without considering network segment factors **

[Example] When the service is deployed, the network segment factor is not taken into consideration. Multiple instances of the service are deployed to a single switch, resulting in the switch failure, and multiple instances of the service are unavailable, resulting in an avalanche of the cluster where the service is located

[Solution] Service deployment should consider geographical factors as much as possible, and multiple instances of the same service should be deployed to different rooms, switches and network segments as far as possible

** Antipattern 3.2.2 resource isolation is not done when service is mixed **

【 Example 】 Multiple services mixed, one of which CPU usage is too high, resulting in other services exception

[Solution] When multiple services are mixed, resource isolation must be done to avoid the unavailability of other services on the same machine because one service occupies too much resources

** Antipattern 3.2.3 does not do the core business and segregate and protect **

[Example] A non-core process writes a large number of messages to MQ due to a bug, which makes the whole MQ cluster unusable and the whole business fails

[Solution] MQ isolation of core and non-core links, separate deployment, non-core process changes will not affect the main process, to ensure the stability of core process and business

3.3. Capacity management

** Antipattern 3.3.1 Capacity planning does not take fault into account **

[Example] There is a small QPS of a service on the line, and only two instances are deployed. When something goes wrong in the peak period of one instance, the traffic falls on the other instance, which overwhelms the service

[Solution] Disaster tolerance factors need to be taken into account when capacity estimation. A certain amount of buffer should be reserved. If you think that too many deployment instances will waste the machine, you can consider using elastic cloud, which is more flexible

** Anti-pattern 3.3.2 Unreasonable capacity planning for new features coming online **

[Example] There are many downstream dependencies on a certain service. When carrying out capacity planning, the focus is on a certain dependent service without a comprehensive evaluation of all the global dependent parties. If a problem occurs to one of the dependent services, the overall service will be unavailable

[Solution] When new functionality comes online, capacity planning is required for all downstream dependent services to prevent bottlenecks

3.4. Change management

** Anti-pattern 3.4.1 Code hitching online **

[Example] Due to the lack of effective code modification management mechanism, a certain product line has experienced many online faults due to the code hitch on the line. Moreover, due to the large number of modifications involved in the changes, it is very difficult to locate and trace the problems

[Solution] Establish a strict code management mechanism, strictly prohibit the code hitch-ride online, to ensure that at any time the trunk is not online verification of the code

** Antipattern 3.4.2 Service rollback missing rollback code **

[Example] There is a problem with the online service, and the code is not rolled back in the first time after the service rollback. On the second day, when other students went online, they brought the problem codes that had not been rolled back online again. When they went online, system failure occurred for two consecutive days

【 Solution 】 When the service rolls back, it also rolls back the code the first time

** Antipattern 3.4.3 Excessive concurrent deployment Settings **

The number of concurrent deployments is so high that only a few machines are available at any one time, causing a cluster avalanche

[Solution] The number of concurrent service deployment configurations to ensure that the remaining machines can carry all the traffic of the business

** Antipattern 3.4.4 Service starts or rolls back too long **

[Example] Service on-line is abnormal, and the rolling back time of a single service is too long, resulting in failure to stop the loss quickly within a short time

[Solution] Regularly check the startup and rollback time of the service to ensure that the rollback operation can be completed in the first time in case of failure

** The anti-pattern 3.4.5 profile lacks an effective verification mechanism **

[Example] The configuration file is produced by the model, and the data distribution system is delivered to the online service in real time. The configuration file generated by the model has a problem, which causes the online fault

[Solution] Establish a strict inspection and verification mechanism for configuration files, especially those produced by the model

** Anti-pattern 3.4.6 Configuration changes do not have grayscale **

[Example] Configuration related modifications, the stability awareness of the attention is not enough, did not carry out enough observation and gray level, resulting in failure

[Solution] All changes, including service changes, configuration changes, data changes and environment changes, need to be strictly observed and gray scale to ensure the quality of the changes

** The anti-pattern 3.4.7 changes were not rigorously tested **

[Example] The change is small, it feels unnecessary to test, and the result is a low-level error, resulting in a failure

[Solution] Any changes should be tested, double check, modify one line of code, or cause stability failures on the line

** Anti-pattern 3.4.8 Changes are made without ** strictly following the change specification

[Example] When going online, the small flow and A machine room were checked in strict accordance with the specifications, and there was no problem with the service and various curves. When going online, B machine room was not checked. As a result, B machine room is out of order. After investigation, it is because the configuration of B machine room is missing

[Solution] Any changes should be rigorously checked in accordance with the change specification, and various curves and metrics of the service should be checked at each stage of going online

** Antipattern 3.4.9 Update DB data offline directly through SQL **

[Example] The database is updated offline directly through SQL without current limiting protection, which leads to great pressure on DB and a large number of timeouts during online service access

Unless in special cases, it is strictly prohibited to operate DB data directly through SQL. Modification needs to be done through the interface, which is convenient to observe through the curve, and can also reduce the risk of direct database modification.

Batch modification of DB data needs to be notified to DBA and can only be carried out after review and confirmation that there is no problem.

Batch change, increase data must do a good job of current limiting measures.

3.5. Monitoring alarm

** Anti-pattern 3.5.1 lacks underlying monitoring **

[Example] Lack of basic monitoring, resulting in failure, failure to perceive the first time.

[Solution] Organize the basic monitoring checklist, and review and drill the basic monitoring of the service regularly.

** Anti-pattern 3.5.2 Lack of business monitoring **

[Example] Lack of business monitoring, resulting in failure, failure to perceive the first time.

[Solution] Business monitoring needs to be added to both core processes and core business metrics.

** Anti-pattern 3.5.3 Alarm threshold setting is problematic **

[Example] Due to a large number of online alarms caused by business bugs, the alarm threshold was temporarily adjusted from 20 to 200 at first. After the problem was fixed and put online, the threshold was always maintained, resulting in failure to report the subsequent business problems in the first time.

[Solution] Try to use a transient masking alarm instead of raising the threshold.

** Anti-pattern 3.5.4 Monitoring alerts are currently invalid **

[Example] The business iteration is too fast, so the monitoring alarm information is no longer matched with the business.

[Solution] Conduct regular drills on alarms to ensure their effectiveness.

For major business iterations, monitoring alerts need to be checklisted.

3.6. Plan management

** Anti-pattern 3.6.1 There is no avalanche plan for abnormal upstream flow **

[Example] A sudden increase in the upstream flow of the service causes the service to be overwhelmed instantly and the system to avalanche

The service must be prepared for avalanches in advance, otherwise it can easily lead to system-wide failure

** Anti-pattern 3.6.2 Service has no anti-brush and anti-attack plan **

[Example] When locating online problems, it is found that there is a large number of interface swiping in an online service, which brings great hidden trouble to the stability of the online system and causes great waste of resources and costs.

[Solution] Online services, especially those that interact more with terminals, need to consider anti-brush and anti-attack strategies in design, and make a plan in advance

** Anti-pattern 3.6.3 There is no plan for handling downstream failures **

[Example] Downstream service failure, the upstream service has no corresponding handling plan, resulting in the downstream failure, because this situation leads to a lot of large failures

[Solution] Downstream faults, especially downstream weakly dependent service faults, need to be dealt with accordingly

** Anti-pattern 3.6.4 Failure of the plan **

[Example] Due to the fast business iteration, the current dependence on a certain downstream has changed from weak to strong. In case of downstream failure, the downgrading plan was implemented but the business failure was not recovered

[Solution] Practice the plan regularly to ensure the effectiveness of the plan

3.7. Stability principles and awareness

** Anti-pattern 3.7.1 Lack of reverence for stability **

Service failure is considered normal, but stability is not considered

[Solution] Technical students need to be in awe of the code and online stability. There should be no chance to think that the bug of one line of code may paralyze the whole business

** There is no first time stop loss when anti-pattern 3.7.2 fails **

[Example] When the service fails, the relevant students locate the cause of the failure in the first time and do not stop the loss in the first time

The failure must be dealt with first priority, the first time to stop the loss

** Antipattern 3.7.3 uses insufficiently proven techniques and solutions **

[Example] A service that uses MQ’s broadcast feature has not been used online by the company. When it goes live, it triggers a bug in the MQ broadcast consumption code, which causes the MQ cluster to be unavailable

[Solution] Try to avoid using technologies and solutions that have not been fully proven. If they have to be used for some reason, there must be corresponding backstop measures. At the same time, the pace of access must be controlled well

** Anti-pattern 3.7.4 Failing to control the access rhythm when using new technology **

[Example] A service uses the broadcast feature of MQ. In the case of insufficient verification time of non-core service, this feature is introduced into the core service. The traffic of the core service is relatively large, which triggers a bug in the consumption code of MQ broadcast, leading to the failure of the unavailability of MQ cluster

[Solution] When introducing new technologies, it is important to control the pace of access and fully verify them on non-critical services before they are applied to core links

** Anti-pattern 3.7.5 stability improvement plan not implemented in time **

[Example] When a service fails, corresponding improvement measures are formulated when reworking, but they are not implemented in a timely and effective manner; Then the problem flared up again, with another service failure.

[Solution] Establish an effective tracking mechanism for the implementation of improvement measures to ensure the effective implementation of improvement measures.


Team introduction

Hitch service team is composed of a group of friends who are united, optimistic and upright and pursue the ultimate. They are committed to building a first-class technology system of security, transaction and marketing service, and helping Didi Hitch realize its mission of “sharing travel makes life better”.

If you want to learn more about Didi Hitch’s premium technology sharing, please follow the official account “Didi Hitch Technology” to read the original article and more technical tips.


Welcome to follow Didi technical official account!

This article by the blog group, multiple articles and other operating tools platform
OpenWriterelease