In this paper, the stability related specifications are given. These specifications have been summarized and reviewed for a large number of real online faults in the past five years since the establishment of Hitch, hoping to help improve the stability of your service.

The server is the largest engineering team within the Hitch technology department, and as people expand and iterate, process specifications play an increasingly important role. On the one hand, standardization can improve our delivery quality and efficiency; on the other hand, we also hope to constantly summarize in the actual combat and explore the best practices applicable to our team.

Based on this, we developed and promoted a set of executable, minimal engineering specifications for server-side development, including development process, stability, performance cost, and more.

In this paper, the stability related norms are presented. These norms are summarized from a large number of real online fault resets in the five years since the establishment of Hitch, hoping to help improve the stability of your service.

1. Noun explanation

There are many technical terms that are used in the following descriptions. For easier understanding, here are some of them:

  • Service classification: According to business needs, we generally need to divide the smallest system into level-1 services. Once problems occur, we need to follow up with the first priority. We define the services that affect the core indicators of the business (such as single order, single order, etc.) as level 1 services, and other services as level 2 services.

  • Preview cluster: A set of environments that are exactly the same as the deployment of the online production environment, but with wireless traffic that can be accessed internally and closed loop traffic within the cluster.

  • Small traffic cluster: a set of environment exactly the same as the deployment of the online production environment. Through flow control, only the traffic of individual cities will fall into this cluster, and the flow within the cluster is closed.

  • Grayscale release: release process is divided into preview cluster, grayscale city cluster, release mechanism of 10% flow, 50% flow and 100% flow to ensure safe online process.

  • Full-link pressure test: a solution to pressure test the production environment without affecting the online service. To find out the capacity and bottleneck of the production environment.

  • Multi-equipment room deployment: When a fault occurs in one equipment room, traffic can be quickly switched to other equipment rooms to reduce loss. The whole solution involves traffic routing, traffic closed-loop, data synchronization, data consistency, disaster response and many other links.

2. Stability specification

Stability design

  • [Mandatory] The caller must set the timeout period, and the call link timeout period decreases from top to bottom. The recommended timeout period is as follows:

  • [Mandatory] New dependencies in the core process are weak dependencies by default. If new enhanced dependencies are needed, they need to be reviewed and decided.
  • [Mandatory] If a downstream service provides service discovery, all services must access the service through service discovery to facilitate service discovery to control downstream nodes and the timeout period.
  • [Mandatory] All internal services must be connected to service discovery, and external services should be connected to service discovery as much as possible.
  • [Suggestion] The proposed framework supports manual one-click fusing of dependent services;
  • [Suggestion] Service design should give priority to stateless design;
  • [Suggestion] It is recommended to consider anti-reentrant for write interfaces.
  • [Suggestion] The system design principle is simple and reliable, and the mature technology should be preferred;
  • [Suggestion] Enforce core services, other suggestions, and interfaces. You are advised to set reasonable traffic limiting configurations.

Deployment and o&M

  1. [Mandatory] It is strictly forbidden to directly operate online data in temporary scripts without interface or encapsulation methods. If necessary, it must pass QA test.
  2. [Mandatory] The service must go online through the online platform, and be connected to the quality platform (functions include automated case, core graph and other online checklist), and be forced to stay and observe.
  3. [Mandatory] Level-1 services must contain preview clusters, low-traffic clusters (except for some special services), and be deployed in two equipment rooms.
  4. [Suggestion] Non-tier 1 online services are recommended to preview clusters.
  5. [Suggestion] It is recommended to plan the capacity of the new service when it goes online. It is recommended to test the module capacity through interface pressure test or full-flow pressure test.

Monitoring alarm

  1. [Mandatory] Online service machines must have basic monitoring alarms, including CPU, IO, memory, disk, COredUMP, and port.
  2. [Mandatory] Online services must have basic service monitoring, including interface QPS, FATAL number, and time consumption.
  3. [Suggestion] Core business indicators (issue, grab, payment, etc.) must be monitored and alerted;
  4. [Suggestion] There should be a whole market of services, which can cover the monitoring of core modules in this direction, so as to quickly locate service problems.

Change management

  1. [Mandatory] Any level 1 service changes need to go through the grayscale publishing mechanism;
  2. [Mandatory] Any level 1 service change, including service change and configuration change, must have a corresponding rollback plan to ensure rapid rollback in case of abnormal change.
  3. [Suggestion] Try to avoid the code ride online;
  4. [Suggestion] During service rollback, you are advised to roll back corresponding codes and configurations to ensure the correctness of the mainline codes.
  5. [Suggestion] You are advised to add a configuration verification mechanism for configuration changes, especially complex configuration changes.

Budget management

  1. [Mandatory] There must be a multi-active cut flow plan, and the effectiveness needs to be guaranteed. Drills must be organized regularly, one in January is recommended;
  2. [Mandatory] The effectiveness of full-link pressure measurement channel should be ensured and pressure measurement should be organized regularly.
  3. [Mandatory] One-click current limiting plan needs to ensure its effectiveness, review and drill regularly;
  4. [Mandatory] The strong dependency downgrade plan needs to ensure its effectiveness and be rehearsed regularly.

Troubleshooting Principles

  1. [Mandatory] When a fault occurs on the line, it must be handled with the first priority.
  2. [Mandatory] When an online fault occurs, if there is any change, the system rolls back at the first time.
  3. [Mandatory] If there is a fault on the line, recheck must be organized;
  4. [Mandatory] There must be a recheck specification, and the recheck shall be carried out in accordance with the specification.

3. Stability antipatterns

This chapter is mainly based on a large number of online fault cases, driven by specific examples, and some stability problems that are easy to occur in each link of the whole process of r&d. Some anti-patterns are extracted for your reference, so as to avoid similar problems in subsequent work and improve the stability of online services.

3.1. Disaster recovery and fault tolerance design

Anti-mode 3.1.1 ** ** Excessive node circuit breaker policy

[Example] In order to improve the success rate of requests, the downstream node is fusing when the downstream node fails. For example, the node is fusing when the access error occurs for 5 consecutive times within 1 minute, and the node is not called any more (it will be recovered after a period of time). One day, the network jitter occurs, and 3 of the 4 instances of downstream service enter the fusing mode. This causes all traffic visiting the downstream to fall on the remaining instance, which is overwhelmed. Downstream service avalanche, entire system unavailable.

[Solution] In circuit breaker mode, circuit breaker protection measures are also required to avoid stability problems caused by excessive circuit breaker.

Antipattern 3.1.2 Fixed retry sequences

【 Example 】 Each retry sequence is set to next.

[Consequence] One is an avalanche: Suppose that the retry sequence of A certain type of Query is A,B, and when A fails,B is burdened with twice as much pressure. If B fails, the next one of B will also be crushed. Suppose the number of retries is 2. If A and B restart at the same time, the query whose retries sequence is A and B must have no result.

[Solution] Evaluate new retry algorithms, such as random retry. However, compared with fixed retry sequences, random retry sequences may also bring risks to the system. For example, they may reduce the cache hit ratio of downstream modules and reduce system performance.

Anti-mode 3.1.3 Improper timeout setting

【 Example 】 The upstream service timeout time is set improperly. When problems occur in the downstream, the upstream service will be dragged down directly.

[Solution] The timeout period should be set based on the 99 bits of link time, and the link communication configuration should be reviewed periodically.

Anti-pattern 3.1.4 does not consider the downstream impact of multiple calls in the same request

[Example] There is no problem in setting the timeout time when the service calls the downstream. However, a downstream service will be called several times in serial in the same request. When the downstream service fails, the upstream service will be directly dragged down.

[Solution] In addition to considering single timeouts to downstream services, you also need to consider overall timeouts to downstream services.

Anti-pattern 3.1.5 Unreasonable retry logic

[Example] There are retries in multiple places along the link. When downstream services fail, retries are amplified, resulting in service avalanches.

[Solution] Evaluate the retry mechanism and tease out the entire link of request processing to ensure that retry converges in one place.

Anti-pattern 3.1.6 does not take into account the impact of business burrs

[Example] There is a feature of a certain service form, which has request spikes at the half hour and the hour. Affected by holidays, the flow of accessing downstream suddenly increases, resulting in an avalanche of downstream services.

[Solution] Balance the service spikes to reduce the impact on the peak flow of downstream services.

Anti-pattern 3.1.7 has no fault tolerance for abnormal input

[Example] The service does not perform fault tolerance for abnormal input, but still processes the abnormal input according to normal logic, resulting in data confusion.

[Solution] Services, especially service portals, must be fault-tolerant of unreasonable abnormal input to ensure the robustness of the entire system.

Anti-pattern 3.1.8 Interfaces do not support idempotent design

【 Example 】 The interface does not support idempotent, and a large number of retries are triggered when the network fails, resulting in a large number of core data errors.

【 Solution 】 The interface design needs to consider the idempotent requirements.

Anti-pattern 3.1.9 does not have weak dependencies on non-core processes

[Example] There is no weak dependency on the process, resulting in a fragile system as a whole. The failure of each dependent unit will lead to the breakdown of the whole business.

[Solution] Comb the process of the system regularly, minimize systematization, and make unnecessary processes as weak as possible.

Anti-pattern 3.1.10 does not consider the effects of ID overflow

[Example] If the ID int32 is used, the ID overflows and the export service fails.

[Solution] When adding resource-related ID fields, consider the ID range and check whether there is overflow risk. Periodically review resource-related ID fields to prevent faults to the maximum

3.2. Deployment, operation and maintenance

Anti-mode 3.2.1 Network Segments Are Not Considered during Deployment

[Example] Multiple service instances are deployed on the same switch without considering network segments. As a result, when the switch is faulty, multiple service instances become unavailable and the cluster where the service resides avalanches

[Solution] When deploying services, consider geographical factors as much as possible, and deploy multiple instances of the same service to different equipment rooms, switches, and network segments as much as possible

Anti-mode 3.2.2 Insufficient Resource Isolation when Services are Mixed

[Example] Multiple services are mixed, and one of them occupies too much CPU, causing other services to become abnormal

[Solution] When multiple services are mixed, isolate resources to prevent other services on the same machine from being unavailable because one service occupies too many resources

Anti-pattern 3.2.3 does not isolate and protect core business

[Example] A non-core process writes a large number of messages to MQ due to a bug. As a result, the whole MQ cluster becomes unavailable and the whole service fails

[Solution] Mq of core and non-core links is isolated and deployed separately. Changes of non-core processes do not affect the main processes, ensuring the stability of core processes and services

3.3. Capacity Management

Anti-mode 3.3.1 Failure factors are not considered in Capacity planning

[Example] The QPS of a certain online service is not large, and only two instances are deployed. When one instance has a problem during the peak period, the traffic falls on the other instance, crushing the service

[Solution] When estimating the capacity, you need to consider disaster recovery factors. Reserve a certain buffer. If you think that too many deployment instances will waste machines, you can use elastic cloud, which is more flexible

Anti-mode 3.3.2 New Online Functions Are Not properly planned

[Example] For a certain service, there are many downstream dependencies. In capacity planning, the focus is on a certain dependent service instead of a comprehensive evaluation of all global dependent parties. When a dependent service fails, the whole service becomes unavailable

[Solution] When a new function is launched, capacity planning for all downstream dependent services is required to prevent bottlenecks

3.4. Change management

Antipattern 3.4.1 code hitching online

[Example] Due to the lack of effective code modification management mechanism, a certain product line has experienced several online failures due to the code being put on line by bus, and due to a large number of modifications involved in the change, it is very difficult to locate and trace the problems

[Solution] Establish a strict code management mechanism, strictly prohibit the code ride online, to ensure that at any time there is no trunk code has not been online verification

Anti-pattern 3.4.2 Missing rollback code during Service Rollback

[Example] The code was not rolled back immediately after the service was rolled back. The next day, when other students went online, the problem code that was not rolled back was brought online again, resulting in two consecutive days of system failure

[Solution] When the service is rolled back, the code is rolled back in the first time

Anti-pattern 3.4.3 Excessively High Concurrent Deployment Settings

The number of concurrent configurations is so high that only a few machines are available at any one time, causing a cluster avalanche

[Solution] The number of concurrent service deployment configurations must ensure that the remaining machines can bear all service traffic

Anti-mode 3.4.4 It Takes too long to Start or Roll Back a Service

[Example] When a service is abnormal online, the rollback time of a single service is too long. As a result, the loss cannot be stopped in a short time

Solution Periodically check the startup and rollback time of the service to ensure that the service can be rolled back as soon as a fault occurs

Anti-mode 3.4.5 The Configuration File Lacks an effective verification mechanism

[Example] The configuration file is produced by the model, and the data distribution system delivers the configuration file to the online service in real time. The configuration file generated by the model has problems, which causes the online fault

[Solution] Establish a strict inspection and verification mechanism for configuration files, especially those generated by the model

Anti-mode 3.4.6 Configuration Change does not have grayscale

[Example] Configuration related modification, stability awareness is not enough attention, not enough observation and gray scale, resulting in failure

[Solution] All changes, including service changes, configuration changes, data changes and environmental changes, need to be strictly observed and grayscale to ensure the quality of changes

Anti-pattern 3.4.7 Changes have not been rigorously tested

【 example 】 The changes were small and there was no need to test, resulting in low-level errors that led to failures

Any change must be tested, double check, modify a line of code, and may lead to on-line stability failure

Anti-pattern 3.4.8 Change is not performed strictly according to the change specification

[Example] When going online, small flow and machine room A were checked strictly in accordance with the specifications, and there was no problem with the service and various curves. When going online, machine room B was not checked. Result B’s equipment room is faulty. The fault is that the configuration of equipment room B is missing

[Solution] Any change should be strictly checked in accordance with the change specification, and all curves and indicators of the service should be checked at each stage of the launch

Anti-mode 3.4.9 Updating DB Data using SQL Offline directly

[Example] The database was directly updated offline through SQL without proper traffic limiting protection, resulting in heavy DB pressure and a large number of timeouts during online service access

[Solution] Unless in special circumstances, it is forbidden to operate DB data directly through SQL. The modification should be done through the interface, which is convenient to observe through the curve and can reduce the risk of directly changing the database.

Batch modification of DB data should be notified to DBA, and the operation can be carried out only after review and confirmation of no problem.

Batch add, add data must do a good job of limiting measures.

3.5. Monitoring alarm

** Anti-pattern 3.5.1 Lack of basic monitoring **

[Example] Lack of basic monitoring, resulting in failure, not the first time perception.

[Solution] Organize the basic monitoring checklist, review and drill the basic monitoring of services regularly.

Anti-mode 3.5.2 Lack of Service Monitoring

[Example] Lack of service monitoring leads to failure detection at first time.

[Solution] Add service monitoring for core processes and core business indicators.

Anti-mode 3.5.3 Alarm Threshold Setting Is Incorrect

[Example] Due to a service bug, a large number of online alarms were generated. Therefore, the alarm threshold was temporarily adjusted from 20 to 200. After the fault repair went online, the alarm threshold was forgotten to be changed back.

[Solution] Try to use a short shielding alarm instead of raising the threshold.

Anti-mode 3.5.4 Monitoring Alarms Currently Failed

[Example] The service is iterated too fast. As a result, monitoring alarms do not match services.

[Solution] Periodically drill alarms to ensure their validity.

For major business iterations, monitoring alarms must be included in the checklist.

3.6. Plan management

Anti-mode 3.6.1 No avalanche prevention plan is Available when upstream traffic is Abnormal

【 Example 】 The service upstream traffic suddenly increases, causing the service to be overwhelmed instantly and the system avalanches

[Solution] The service must have an avalanche protection plan in advance, otherwise it can easily lead to system-wide failure

Anti-mode 3.6.2 Service does not have a brush and attack defense plan

[Example] When locating online problems, it is found that there are a large number of brush interfaces in an online service, which brings great hidden trouble to the stability of the online system and causes great waste of resources and costs.

[Solution] Online services, especially those with more interaction with the terminal, need to consider anti-brush and anti-attack strategies and make plans in advance

Anti-mode 3.6.3 There is no solution for downstream faults

[Example] The downstream service fails. The upstream service does not have a corresponding processing plan and is dragged down by the downstream, because this situation leads to a large number of failures

[Solution] Downstream faults, especially downstream weakly dependent service faults, need to be dealt with accordingly

Anti-mode 3.6.4 The scheme is invalid when a fault occurs

[Example] Due to rapid service iteration, the current dependence on a downstream has changed from weak to strong. When a downstream fault occurs, a downgrade plan is implemented but the fault is not recovered

[Solution] Rehearse the plan regularly to ensure its effectiveness

3.7. Stability principle and consciousness

* Anti-pattern 3.7.1 Lack of reverence for stability **

Service failures are considered normal, but stability is not considered

[Solution] Technical students need to be in awe of the code and online stability, and should not have any fluke mentality. A bug of one line of code may lead to the breakdown of the whole business

Anti-mode 3.7.2 Failure to make the first stop

[Example] When the service fails, the relevant students locate the cause of the fault in the first time, and do not stop the loss in the first time

[Solution] Failure must be dealt with first priority, the first time stop loss

Anti-pattern 3.7.3 Use of inadequately validated technologies and solutions

【 Example 】 A service uses the broadcast feature of MQ, which has not been used online by the company. After it goes online, a bug in the MQ broadcast consumption code is triggered, causing the failure of the MQ cluster

[Solution] Try to avoid using technologies and solutions that have not been fully verified. If they must be used for some reasons, corresponding measures must be taken to protect the bottom, and the rhythm of access must be well controlled. Only after the verification on non-critical services is sufficient, can they be applied to the core link

Anti-mode 3.7.4 Failure to control the access rhythm when using a new technology

【 Example 】 A service uses the broadcast feature of MQ. When the verification time of non-core service is insufficient, this feature is introduced into the core service. The traffic of the core service is heavy, which triggers a bug in the MQ broadcast consumption code, causing the failure of mq cluster

[Solution] When introducing a new technology, the rhythm of access must be well controlled and fully verified on non-critical services before it can be applied to the core link

Anti-pattern 3.7.5 The stability improvement scheme was not implemented in time

[Example] When a service fails, corresponding improvement measures are formulated during the re-check, but they are not implemented timely and effectively; Then the problem broke out again, another service failure.

[Solution] Establish an effective tracking mechanism for the implementation of improvement measures to ensure the effective implementation of improvement measures.


Team introduction

The Hitch service team is formed by a group of solidarity, optimism, integrity and pursuit of the ultimate partners, committed to building a first-class security, transaction and marketing service technology system, to help Didi Hitch achieve the mission of “better life by sharing travel”.

If you want to know about didi Hitch technology, please pay attention to the “Didi Hitch Technology” public account, read the original article and more technical dry goods.


Welcome to pay attention to didi technology official account!

This article is published by OpenWrite, a blogging tool platform