Availability up to 5 nines! Payment system high availability architecture design combat

The background,

For Internet applications and large enterprise applications, most of them require 7*24 hours of uninterrupted operation as much as possible, and to achieve complete uninterrupted operation can be said to be “extremely difficult”. For this reason, there are typically three niners to five niners to measure the usability of an application.

Usability metrics	calculation	Unavailable time (minutes)
99.9%	0.1% * 365 * 24 * 60	525.6
99.99%	0.01% * 365 * 24 * 60	52.56
99.999%	0.001% * 365 * 24 * 60	5.256

It is not easy to maintain high availability for an application with increasing functionality and data volume. In order to realize high availability, CreditEase payment system has done a lot of exploration and practice from the aspects of avoiding single point of failure, ensuring the high availability of application itself, and solving the growth of transaction volume.

Creditease’s payment system’s service capability can reach 99.999% without considering the sudden failure of external dependent systems, such as network problems, three-party payment and large areas of bank unavailability.

This article focuses on how to improve the usability of the application itself, with further discussions on how to avoid single points of failure and how to address transaction growth in other parts of the series.

In order to improve the usability of your application, the first thing to do is to avoid failure as much as possible, but it is impossible to avoid failure completely. The Internet is prone to the “butterfly effect,” where any seemingly small accident with zero chance of happening can occur and then be magnified infinitely.

Everyone knows RabbitMQ is very stable and reliable. Creditease has been using single-point RabbitMQ for the first time and has never had a failure, so it’s psychologically unlikely that something will go wrong.

Until one day, the physical host on which the RabbitMQ node is located breaks down due to neglect. At that time, the RabbitMQ node becomes unavailable, causing the system service to become unavailable.

Failure is not terrible, the most important thing is to find and solve the fault in time. The requirement of CreditEase Payment system for its own system is to find faults in seconds, diagnose and solve faults quickly, so as to reduce the negative impact brought by faults.

Second, the problem of

Learn from history. First of all, let’s briefly review some problems that CreditEase payment system once encountered:

(1) The new development colleague neglected the importance of setting timeout time when dealing with the new three-way channel due to lack of experience. It is a small detail like this that causes all transactions in this tripartite queue to be blocked and affects transactions in other channels.

(2) CreditEase payment system is distributed deployment, and supports grayscale publishing, so the environment and deployment modules are very many and complex. One time, a new module was added. Because there were multiple environments, and each environment was dual-node, the number of database connections was insufficient after the new module went online, which affected the functions of other modules.

(3) The same problem is timeout. A three-party timeout causes all previously configured worker threads to be exhausted, so that there is no thread to process other transactions;

(4) Third party A provides authentication and payment interfaces at the same time. The sudden increase in transaction volume of CreditEase’s payment system on one of these interfaces triggers DDoS restrictions of third Party A on the network operator. Generally, the IP address at the egress of the equipment room is fixed. Therefore, the network operator mistakenly thinks that the transactions from the IP address at the egress are traffic attacks. As A result, the authentication and payment interfaces of party A are unavailable at the same time.

(5) Another database problem, also because of the sudden increase in the transaction volume of CreditEase payment system. When the volume is small, the value generated by the system matches the 32 bits of the field, and the sequence will not be promoted. However, as the volume of transactions increases, the sequence goes up unwittingly, resulting in a shortage of 32 bits.

This kind of problem is very common to the Internet system, and can be hidden, so how to avoid it is very important.

Third, solutions

Let’s look at the changes in CreditEase’s payment system from three aspects.

3.1 Avoid faults as much as possible

3.1.1 Design fault-tolerant system

For example, rerouting, for user payment, users do not care about the specific channel from which their money is paid out, users only care about success. Creditease payment system is connected to more than 30 channels, and it is possible that the payment in channel A is unsuccessful. In this case, dynamic rerouting is required to B or C channels. In this way, users can avoid payment failure through system rerouting and realize payment fault tolerance.

There is also fault tolerance for OOM, just like Tomcat. If you reserve some memory for the app itself, you can catch the exception when it happens on OOM.

3.1.2 “Fail Fast Principle” for Certain links

The Fail Fast principle is that when a problem occurs in any step of the main process, the process should be closed quickly and reasonably, rather than waiting for a negative impact.

Here are a few examples:

(1) When the payment system is started, it needs to load some queue information and configuration information to the cache. If the loading fails or the queue configuration is incorrect, the request processing process will fail. The best way to deal with this is to load data failure and exit the JVM directly to avoid subsequent startup unavailability;

(2) The maximum response time of real-time transaction processing of the payment system is 40s. If the response time exceeds 40s, the pre-processing system will not wait, release the thread and inform the merchant that the processing is in progress. The subsequent processing results will be obtained by notification or active query of the business line;

(3) CreditEase Payment system uses Redis as the cache database, with functions such as real-time alarm burial point and weight check. If the connection to Redis exceeds 50ms, the Redis operation will be automatically abandoned. In the worst case, the impact of this operation on the payment will be 50ms, which is controlled within the allowable range of the system.

3.1.3 Design a system with self-protection capability

Systems generally have third-party dependencies, such as databases, three-party interfaces, and so on. During system development, it is necessary to be suspicious of the third party to avoid the chain reaction when the third party has problems, resulting in downtime.

(1) Split the message queue

Creditease Payment system provides a variety of payment interfaces to merchants, the common ones are fast, personal e-banking, corporate e-banking, refund, cancellation, batch payment, batch withholding, single payment, single withholding, voice payment, balance query, ID authentication, bank card authentication, card encryption authentication, etc. There are more than 30 corresponding payment channels such as wechat Pay, ApplePay and Alipay, and hundreds of merchants are connected. In these three dimensions, how to ensure that different businesses, three parties, merchants, and payment types do not affect each other, creditEase payment system does is to split the message queue. The following figure shows the split of some business message queues:

(2) Limit the use of resources

Limiting the use of resources is the most important point in a high availability system. It is also easy to overlook that the design of resources is relatively limited. Excessive use of resources will naturally lead to application breakdown. To this end, CreditEase payment system has done the following homework:

Limit connection number

With distributed scale-out, the number of database connections needs to be considered rather than endless maximization. The number of connections to the database is limited, and all modules need to be considered globally, especially the increase of horizontal scaling.

Limit memory usage

Excessive memory usage will lead to frequent GC and OOM usage. Memory usage mainly comes from the following two aspects:

A: The collection capacity is too large;

B: Not releasing objects that are no longer referenced, such as objects placed in ThreadLocal, until the thread exits.

Limiting thread creation

The unrestricted creation of threads ultimately makes them uncontrollable, especially the creation of threads hidden in the code.

When the SY value of the system is too high, Linux needs to spend more time on thread switching. The main reason for this in Java is that a large number of threads are created, and these threads are constantly blocking (lock wait, IO wait) and changing execution state, resulting in a large number of context switches.

In addition, Java applications operate on physical memory outside the JVM heap when creating threads, and too many threads use too much physical memory.

For thread creation, it is best to use a thread pool to avoid context switching due to too many threads.

Limit of concurrent

Those who have done the payment system should be clear that some three-party payment companies have requirements for the concurrency of merchants. The number of concurrent requests the three parties open is evaluated based on the actual transaction volume, so if the concurrency is not controlled and all transactions are sent to the three parties, the three parties will only reply “please reduce the frequency of submissions”.

Therefore, special attention should be paid in both the system design phase and the code review phase to limit the concurrency within the scope allowed by the three parties.

We talked about creditEase payment system in order to achieve the usability of the system to make three changes, one is to avoid failure as much as possible, then the next two.

3.2 Fault Detection in time

Fault like the devil into the village, to the unexpected. When the defense line of prevention is broken, how to pull up the second defense line in time to find the fault to ensure availability, at this time the alarm monitoring system began to play a role. A car without a dashboard, it is impossible to know the speed and the amount of oil, whether the turn signal is bright, even if the “old driver” level is very dangerous. Similarly, systems need to be monitored, and it is best to alert in advance when a danger occurs so that a failure can be resolved before it actually poses a risk.

3.2.1 Real-time alarm system

Without a real-time alarm, the uncertainty of the operating state of the system can cause unquantifiable disasters. The monitoring system indicators of CreditEase Pay system are as follows:

Real-time – second level monitoring;
Comprehensiveness – Covering all system services, ensuring no dead Angle coverage;
Practicability – the warning is divided into several levels, monitoring personnel can make accurate decisions according to the severity of the warning conveniently and pragmatical;
Diversity – Early warning mode to provide push and pull mode, including SMS, email, visual interface, convenient monitoring personnel to find problems in time.

Alarm is mainly divided into stand-alone alarm and cluster alarm, and Creditease payment system belongs to cluster deployment. Real-time warning is mainly realized by statistical analysis of real-time burial point data of each business system, so the difficulty lies mainly in data burial point and analysis system.

3.2.2 Buried point data

In order to achieve real-time analysis without affecting the response time of the transaction system, CreditEase payment system makes real-time data burial points through Redis in each module, and then summarizes the burial point data to the analysis system, which analyzes and alarms according to the rules.

3.2.3 Analyzing the system

The most difficult part of the analysis system is the business alarm point. For example, which alarms must be responded as soon as they come out, and which alarms only need attention once they come out. Here is a detailed introduction to the analysis system:

(1) System operation architecture

(2) System operation process

(3) System business monitoring point

Creditease payment system’s business monitoring points are summed up bit by bit in the daily operation process, which is divided into two chunks of police and attention.

A: A police class

Network anomaly warning;
Warning of single order timeout and not completed;
Real-time trading success rate warning;
Abnormal state warning;
No return disk warning;
Failure notification warning;
Abnormal failure warning;
Frequent warning of response code;
Check inconsistent warning;
Special state warning;

B: pay attention to class

Abnormal trading alert;
Transaction volume more than 500W warning;
SMS backfill timeout warning;
Illegal IP warning;

3.2.4 Non-service monitoring points

Non-service monitoring points mainly refer to monitoring from the perspective of operation and maintenance, including network, host, storage, and log. The details are as follows:

(1) Service availability monitoring

Use the JVM to collect information about the number and time of Young GC/Full GC, heap memory, and the stack of Top 10 threads, including the length of the cache buffer.

(2) Traffic monitoring

Agent Monitoring Agents are deployed on each server to collect traffic in real time.

(3) External system monitoring

Gap detection is used to observe the stability of the three parties or network.

(4) Middleware monitoring

For MQ consumption queue, analyze queue depth in real time through RabbitMQ script probe;
For databases, install the XDB plug-in to monitor database performance in real time.

(5) Real-time log monitoring

Rsyslog is used to collect distributed logs, and then system analysis is performed to monitor and analyze logs in real time. Finally, visual pages are developed to show to users.

(6) System resource monitoring

Zabbix monitors the HOST CPU load, memory usage, upstream and downstream traffic of each NIC, disk read/write rate, disk read/write times (IOPS), and disk space usage.

The above is what the real-time monitoring system of CreditEase Payment system does, which is mainly divided into two aspects: business point monitoring and operation and maintenance monitoring. Although the system is distributed deployment, each warning point is second-level response. In addition, there is also a difficulty in the alarm points of the business system, that is, some alarms may not have problems if a small number of alarms are reported, but there will be problems if a large number of alarms are reported, that is, the so-called quantitative change causes qualitative change.

For example, take the network exception as an example. If one transaction occurs, it may be network jitter, but if more than one transaction occurs, it needs to pay attention to whether there is really a problem with the network. For network exception, the alarm sample of Creditease payment system is as follows:

Single-channel network exception warning: Within 1 minute, 12 network exceptions occurred in channel A continuously, triggering the alarm threshold;
Multi-channel network exception warning 1: Within 10 minutes, three network exceptions occur every minute, involving three channels, and the alarm threshold is triggered.
Multi-channel network exception warning 2: Within 10 minutes, a total of 25 network exceptions occurred, involving 3 channels, triggering the alarm threshold.

3.2.5 Log Recording and Analysis System

For a large system, it can be difficult to log and analyze a large number of logs on a daily basis. Creditease payment system has an average of 200W orders every day, and a transaction passes through more than a dozen modules. Assuming an order records 30 logs, it can be imagined how huge the log volume will be every day.

The analysis of the log of CreditEase payment system has two functions. One is to provide real-time log anomaly warning, and the other is to provide order track for operation personnel.

(1) Real-time log warning

Real-time log warning is for all real-time transaction logs, real-time capture with Exception or Error keywords and then alarm. The advantage of this is that if there are any runtime exceptions in the code, they will be discovered as soon as possible. Creditease Payment system processes real-time log warning by first collecting logs using Rsyslog, then capturing logs in real time through the analysis system, and then making real-time warning.

(2) Order track

For a trading system, it is necessary to know the status flow of an order in real time. Creditease payment system initially used the database to record the order track, but after running for a period of time, it found that the database table was too large for maintenance due to the sharp increase of orders.

Appropriate letter payment system now, each module through print log track, track log print format according to the structure of the database table to print, print good after all log, rsyslog to complete the log collection, log analysis system can real-time grasping print specification, parsing and then stored in the database by date, and show a visual interface to the operating people.

Log printing specifications are as follows:

2016-07-22 18:15:00. 512 | | the pool – 73 – thread – 4 | | channel adapter | | channel adapter – hair after three parties | | CEX16XXXXXXX5751 | | 16201 xxxx337 | | | | | | 04 | | 9000 | | processing | settlement platform news 】【 | 0000105 | | 9 8XX543210||GHT||03||11||2016-07-22 18:15:00. 512 | | overed | | | | s | | tunnelQuery | | true | | | | Pending | | | | 10.100.140.101 | | 8 ed4 cff785d d01 0-4 – b771 – cb0b1faa7f95 | | 10.999.140 0.01. 101 | | O001 | | | | | | | | | | | | http://10.100.444.59:8080/regression/notice | | | | 240 | | 2016-07-20 19:06:13. 000 XXXXXXX

| | 2016-07-22 18:15:00. 170 | | 2016-07-22 18:15:00. 496 XXXXXXXXXXXXXXXXXXXX

| | 2016-07-20 19:06:13. 000 | | | | | | | | 01 | | 0103 | | 111 XXXXXXXXXXXXXXXXXXXXXXXXX

||8fb64154bbea060afec5cd2bb0c36a752be734f3e9424ba7xxxxxxxxxxxxxxxxxxxx

||622xxxxxxxxxxxxxxxx||9bc195a59dd35a47||f2ba5254f9e22914824881c242d211

||||||||||||||||||||6xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx010||||||||||

The visual trace of the brief log is as follows:

In addition to the above, the logging and analysis system also provides the downloading and viewing of transaction and response messages.

3.2.6 7* 24-hour monitoring room

The alarm items above creditEase payment system provide operators with two ways to push and pull, one is SMS and email push, the other is report display. In addition, due to the importance of payment system compared with other Internet systems, CreditEase Payment system adopts 7*24 hour monitoring room to ensure the security and stability of the system.

3.3 Troubleshooting in time

After a failure occurs, especially in a production environment, the first thing to do is not to find the cause of the failure, but to handle the failure as quickly as possible to ensure the availability of the system. Common faults and handling measures of CreditEase Pay system are as follows:

3.3.1 Automatic Repair

As for the automatic repair part, the common faults of CreditEase payment system are caused by the instability of the three parties. In this case, the system mentioned above will automatically reroute.

3.3.2 Service is degraded

Service degradation refers to the disabling of certain functions in the event of a failure that cannot be quickly fixed to ensure that core functions are available. When creditEase payment system is promoting for merchants, if the transaction volume of a certain merchant is too large, it will adjust the traffic of this merchant in real time to degrade the service of this merchant, so as not to affect other merchants. There are many similar scenarios. The specific service degradation function will be introduced in the following series.

Fourth, the Q&A

Q1: Can you give us some details about how RabbitMQ went down?

A1: The downtime of RabbitMQ raises questions about the availability of the system. At the time, we didn’t have the downtime of RabbitMQ itself (RabbitMQ is still stable), it was the hardware that went down, but the problem was that the RabbiMQ was deployed on a single point of deployment. The common assumption that RabbitMQ does not go down is that it does not go down in the container, so the problem for us is that we cannot have a single point for all services, including application servers, middleware, network equipment, etc. Single point not only needs to consider from the single point itself, such as the whole service to do double, and then AB test, of course, there are double room.

Q2: Is your company’s development and operation together?

A2: We have separate development and operation and maintenance. Today’s sharing is mainly based on the usability of the whole system, with more development and some operation and maintenance. These CreditEase payment systems have gone through the road, is all the way I have witnessed.

Q3: Is your backend all Java? Any other languages considered?

A3: most of our current systems are Java, and a few are python, PHP and C++. It depends on the business type. Java is the most suitable for us at this stage, and other languages may be considered as the business expands.

Q4: Be suspicious of third-party dependency. Can you give a specific example of how to do that? What if the third party doesn’t work at all

A4: Systems generally have third-party dependencies, such as databases, three-party interfaces, etc. During system development, it is necessary to be suspicious of the third party to avoid the chain reaction when the third party has problems, resulting in downtime. Everyone knows that when something goes wrong in the system it snowballs, it gets bigger. For example we sweep the yard channel, if only one sweep code channel, when the scan code channel problems occur when there is no any way, so doubt it from the start, through access to a number of channels, if an exception occurs, the real-time monitoring system trigger alarm automatically after routing channel switch, ensure service availability; Second, asynchronous message splitting for different payment types, merchants, and transaction types ensures that if one type of transaction is unpredictably abnormal, it will not affect the other channels. This is just like the multi-lane highway, where fast and slow lanes do not affect each other. In fact, the general idea is fault tolerance + split + isolation, this specific problem specific treatment.

Q5: After the payment time out, there will be network problems. Will the money be paid and the order be lost? How to do disaster recovery and data consistency?

A5: The most important thing in payment is security. Therefore, we adopt a conservative processing strategy for order status. Therefore, for orders with network exceptions, we set the processing status, and finally complete the final consistency with the bank or the three parties through active query or passive notification. In the payment system, in addition to order status and response code, it is known that the bank or the three parties respond by response code, and the translation of response code and order status must be conservative, so as to ensure that there will be no problems such as overpayment or underpayment of funds. In short, the general idea at this point is that capital security first, all strategies are white list principle.

Q6: As mentioned earlier, if one payment channel times out, the routing policy will be distributed to another channel. According to the channel diagram, they are different payment methods, such as Alipay or wechat Pay. If I only want to pay through wechat, why don’t I try again instead of switching to another channel? Or does the channel itself mean the request node?

A6: First of all, rerouting cannot be done for timeout, because socket timeout cannot determine whether the transaction is sent to the three parties and whether it has been successful or failed. If it is successful, retry again. If it is successful, the payment is overpayment, and the capital loss in this case is impossible for the company. Secondly, in accordance with the need of routing functions, business types, if it is a single generation of cash transactions, the user is not concerned about which channel is out of money, is can route, if it is a code channel, if user use WeChat swept yards, affirmation is ultimately go WeChat, but we have a lot of the middle channel, WeChat is through the middle channel out, here we can route among different channels, In the end, it’s still wechat Pay for users.

Q7: Can you give an example of an automatic repair process? How to find unstable to rerouting details?

A7: Automatic repair is fault tolerance with reroute. This is a very good problem. If instability is found then reroute is decided. Rerouting must make it clear that the current rerouted transaction is not successful before it can be rerouted, otherwise it will lead to the problem of overpayment and overpayment. At present, rerouting in our system is mainly made in two ways: post-event and in-event. For post-event, for example, if a channel is found to be unstable through the real-time warning system within 5 minutes, the transactions after the current period will be routed to other channels. In view of things, mainly through the analysis of the failure response code returned by each order, the state of the response code is combed, and it is clear that rerunning can only be done. Here I am referring to list these two points, the other point of business is also very much, in view of the space reasons, do not do, but the general idea is that you must have a memory real-time analysis system, second level decision, the system must be fast, and then combined with decision-making support, real-time analysis and off-line analysis of our real time warning system will do the second level.

Q8: Is there a regular pattern of merchant promotion? How different are the sales spikes compared to normal times? Are there technical drills? What is the demotion priority?

A8: For merchant promotion, we usually keep in touch with merchants in advance to understand the time point of promotion and promote sales in advance, and then do some targeted things; The gap between the promotion peak and the time is very large, the promotion is generally more than 2 hours, for example, some sell financial products, the promotion is concentrated in 1 hour, so the peak is very high; Technology drill is that we understand the sales promotion of merchants, and then estimate the processing capacity of the system, and then do the drill in advance; The demoted priority is mainly for merchants. Since there are many payment scenarios that access our merchants, such as financial management, collection and payment on behalf of agents, speed, and code scanning, etc., our overall principle is that different merchants must not influence each other, because your family’s promotion cannot affect other merchants.

Q9: How to store rsyslog collected logs?

A9: This is a good question, at first we log is order track log is recorded in a database table, it is found that an order flow need a lot of module, the log track a order is around 10, 400 w deal a day, the database tables have a problem, even if the split is also affect the database performance, And this is ancillary business and should not be done. Then, we found that writing the log was better than writing the database, so we printed the real-time log in tabular form, and printed it to the hard disk, which is not very large because it is only a real-time log, in a fixed directory of the log server. Due to log are distributed on the machine, and then through the collection log to a centralized place, this is by mounting the store, and then have a special operations team to write program to real-time resolving these forms in the form of a log, finally through visual display pages to page operations, operational staff see order trajectory is almost real time, Your concern about how to store is actually not a problem, because we have real-time logs and offline logs, and then the offline logs over a certain period of time are cut and eventually deleted.

Q10: How do system monitoring and performance monitoring work together?

A10: In my understanding, system monitoring includes system performance monitoring. System performance monitoring is a part of the overall system monitoring, and there is no coordination problem. System performance monitoring has multiple dimensions, such as application layer, middleware, container, etc. For non-service monitoring of the system, you can view the article sharing.

Author: Feng Zhongqi

Source: CreditEase Institute of Technology