Define the target

Since our goal is to be highly available, it is important to clarify what high availability means and make it quantifiable by disassembling the goal. As I understand it, goals can be broken down into the following three categories:

1. Maintain high service stability

System stability is the fundamental purpose of high availability. Generally speaking, the system can continue to be available, will not break down without reason, and can still work normally under high pressure.

2. Fast fault location

From the perspective of actual engineering, services that do not have faults do not exist. Therefore, a fault must be quickly discovered and located. Before external users discover faults, an alarm mechanism can be used to accurately locate the cause of the fault, helping engineers to solve the problem as soon as possible and preventing further service impact.

3. Fast service recovery

A few more words need to be said about the difference between “resume business” and “fix the problem,” which also illustrate the two different ways we approach problems after an online failure. Simply put, “back to business” means that the cause of the online failure can be put aside until we find a quick, temporary solution to get the business going. Many students have a habit of thinking when dealing with production failures: first try to find the cause of the problem, then change the code to solve the problem, test, release online, and finally business functions can work normally. In fact, the time cost of a process coming down is very high, and services are greatly affected by this fault. For example, the service response on a machine is slow and the request times out. The possible reasons are: network bandwidth problems, machine disk problems, machine CPU or Memory shortage, application loop, JVM garbage collection time is long…… It is difficult to sift through so many possible causes in just a few minutes, but we can resume business without knowing the true cause. For example, the simplest way is to take this machine offline immediately and distribute the traffic to other machines or new machines.

Now that we have these three split goals, the rest of the architecture will be executed around these three goals.

Service tier + Service degradation

Service classification: Services are classified into core services and common services based on service requirements. Core services and common services do not affect each other. Resources behind services, caching, databases, and MQ are separated from each other. Service hierarchy, corresponding to our subgoal 1.

There are two key points:

1. Extracting core services. For example, in our mutual finance business, users, orders and payment are core services, while message push and marketing coupons and points are common services. Core services are what customers must use. Once there is a problem with core services, customers can’t buy products. Ordinary services, even if there is a problem, will not temporarily affect the transaction. On the other hand, the code for ordinary services changes frequently in our daily work. Therefore, ensuring the normal use of core services is our primary goal.

2. Separate different levels of service resources, including server, cache,MQ, DB, etc., because as long as different services share resources, common services may affect core services. For the simplest example, if services share a set of Redis, the quality of the core service will suffer if a large number of message requests consume the number of redis connections.

Service degradation: When a fault occurs, common services can be directly degraded to protect core services. Service degradation, corresponding to our subgoal 3.

After splitting into core services and common services, services are called each other in many business scenarios, which may affect each other. For example, before we push messages (common services), we need to query user information (core services). A large number of messages will cause great pressure to the core business system. In this case, we can stop non-core services to ensure that core services are not affected.

In addition, the best way to do service degradation is to modify the dynamic configuration. Rather than manually going online to modify static configurations or release new versions, which is error-prone and inefficient. Therefore, I recommend services similar to Ali Diamond. After accessing Diamond, groovy script will directly synchronize the latest configuration to the service through diamond background, and the degraded operation will be completed without restarting the service.

Establishing Layered Monitoring

The purpose of establishing a monitoring layer is to monitor all the information related to fault analysis and location. The monitoring layer is divided into five layers, and the meanings of each layer are as follows:

The network layer

Analyze the network access situation. For example, a large number of external requests occupy the bandwidth of the external network card. Immediately analyze whether the traffic is normal.

The interface layer

Collect the access status of exposed interfaces, including interface execution time, return status code, call times, etc. We need to pay attention to the API interface that we visit the most frequently. According to the access status of the interface, we can determine whether service expansion is needed, and determine whether there is external abnormal access, machine brush, etc. If a large number of error codes are returned on some interfaces, we need to find out the cause of the interface access failure in the first time.

The business layer

Collect and analyze how core business and common services run and call each other, for example, if a service generates a large number of Exceptions or dubbo service invocation times out.

The middleware layer

The middleware layer refers to the various types of middleware on which services depend, such as containers, caches, and message queues. Different middleware pay attention to different information, such as database Redis monitoring indicators including connection number, request number, rDB&AOF execution, IO frequency, cache hit ratio, etc.

System layer

The system layer refers to the operating system status and collected information, including CPU usage, memory usage, network adapter traffic, and connection number.

conclusion

The high availability architecture is divided into three sub-goals, and three optimization ideas are proposed for these three sub-goals: service hierarchy + service degradation + hierarchical monitoring. Around these three optimization direction, layer by layer, finally let the technical architecture of the system get qualitative improvement.