One, the introduction

According to the CAP principle, a distributed system cannot guarantee Consistency after ensuring Availability and Partition tolerance. We believe that as long as there is a network call, there will be a possibility of call failure, and there must be long or short inconsistent states between systems. Nowadays, with the popularity of servitization, how to timely discover the inconsistent state between system services and how to quantify the data consistency of a system have become the problems that developers in every distributed environment need to consider and solve.

Second, the background

Take the transaction link as an example, there are some potential inconsistent scenarios as follows:

  • The order payment was successful, but the status of the order is still “pending payment”.
  • The logistics has been delivered, but the order is still “to be delivered”.
  • The bank refund has been received, but the order is still “refund in progress”.
  • It has been more than 7 days since the order was shipped, but it has not been completed automatically

Each of the above business scenarios can generate user feedback and bring users trouble. The core purpose of the business reconciliation platform is to find and fix similar problems in time. So that problems are addressed in advance of feedback.

Three, challenge

So what challenges will a business reconciliation platform face?

Our core demands for a business reconciliation platform mainly include facilitating quick access of business systems, processing massive data of business parties, and ensuring certain real-time performance. This will profoundly affect the system design of business reconciliation platform.

Four, architecture,

From the local to the whole, this paper first from the perspective of solving the above three problems, to look at the local design of the favorable business reconciliation platform, and then look at the overall system structure.

4.1 Easy Access

We believe that all reconciliation processes can be divided into four steps: “data loading”, “conversion and analysis”, “comparison” and “result processing”. To accommodate diverse business scenarios, each of these steps needs to be choreographed, placing a variety of differentiated execution components. At each process node, you need to be free to choose which component to embed through rules. Second, the data needs to be converted from the original format to a standard format for accounting (based on the standard format, it can be used as a universal comparator for standards). To sum up, we believe that the reconciliation engine needs to have the following capabilities:

  • Process choreography capability
  • Rules of ability
  • Plug-in access capability

The checking engine structure of the current business checking platform is as follows:

Among them, ResourceLoader, Parser, Checker, ResultHandler are all standard interfaces. All Spring beans with corresponding interfaces can be programmed into the reconciliation process, including the plugin implemented by the business side. This makes it plug-in and choreographed. The functions of each process node are as follows:

  • ResourceLoader: Provides loader factories based on various data sources (DB, FILE, RPC, REST, etc.) to load the raw data of each data source. Loading modes Driver loading, parallel loading, and multi-party loading are supported. The business side can also implement its own loader and embed it in the reconciliation process using process choreography capabilities.
  • Parser: Models loaded raw data and transforms it into a standard accounting model. Provide scripted (Groovy) transformations using the rules engine.
  • Checker: Compares the specified fields and rules according to the configuration and generates the reconciliation result. Comparison strategies such as findFirst (find the first inconsistency) and Full (find all inconsistencies) are supported.
  • ResultHandler: Uses the specified handler to process the results. Common handlers include persistence, sending alarm emails, and even repairing data directly.

Through a unified facade, the whole reconciliation process is connected in series. When executing different nodes, select different default components or plug-ins based on the configuration. Each process node can be choreographed in the administrative background:

4.2 High Throughput

In some offline scheduled reconciliation scenarios, the data volume of a single reconciliation may reach millions or even tens of millions. This poses a throughput challenge for the reconciliation platform. Our usual solution to the problem of massive data is “disassembly”. Distributed task splitting + task splitting in a single machine to make data blocks smaller. At the same time, we can also use some big data tools to help us reduce the burden.

There are currently two modes of reconciliation: one is a conventional mode in which data is pushed to the DB of the reconciliation center through the data platform (which contains all raw primary key data to be reconciled, such as the order number), and then the order center cluster loads the data for comparison through a sharding strategy and load in batches by pagination. When the amount of data exceeds 10 million, the Spark engine of the data platform is used to obtain data from the Hive table and send the data to the NSQ (Self-developed message queue). NSQ will select one consumer for delivery (not multiple consumers). Tens of millions of bytes of data can then be turned into messages that are executed by decentralized reconciliation servers.

The account checking task is usually performed in the early hours of the morning when the traffic is low, because during the account checking process, real-time data is obtained through reverse checking service interfaces. Better yet, reconciliation can be done without backchecking the business interface. Therefore, quasi-real-time synchronization of service data is required to enter the DB cluster in the reconciliation center in advance.

The main idea is to synchronize data based on binlog logs of the service DB or messages of the service system. The subsequent process is similar.

4.3 High real-time

Some specific business scenarios, such as the buyer has successfully paid, but the order status is still pending due to a delay in the payment status callback from a third party in the bank. This kind of circumstance, buy a home to be met more anxious, the likelihood produces complain. Some of these scenarios require real-time reconciliation, also known as second reconciliation.

Second-level reconciliation is usually triggered based on business messages, and the reconciliation task needs to be completed within a short time after the event is triggered. And the triggering of event messages is often characterized by high concurrency, so the corresponding architecture is needed to support it.

In the design, EventPool is mainly added to buffer and process the event messages with high concurrency, and pipeline for limiting, sampling, routing and processing is added. At the same time, before entering the event processing thread pool, it needs to enter the blocking queue to avoid a large number of requests directly depleting thread resources, and realize the asynchronization of event processing. The processing thread batch takes tasks from the blocking queue to execute. At the same time, delayed reconciliation can be realized by using delayed blocking queue. (If we immediately compare what we think is inconsistent, it is often inconsistent. Therefore, the comparison needs to be triggered after the event has occurred for a period of time.)

4.4 Overall Design

The above introduces the design of each part of the business reconciliation platform, and the following is the overall structure.

On the whole, it mainly adopts the layered architecture of scheduling layer + reconciliation engine (Core + Plugin)+ infrastructure. The scheduling layer is mainly responsible for task triggering, task splitting and scheduling. The reconciliation engine performs a choreographed reconciliation process; The infrastructure layer provides basic capabilities such as rules engine, process engine, generalization call, monitoring, and so on.

Fifth, health degree

The reconciliation center can get the data consistency information of the business system and its entire link. Based on this, the reconciliation platform has the ability to give feedback to business system and link health.

Sixth, to build

As mentioned earlier, the reconciliation process is split into four fixed process nodes with four corresponding standard interfaces. The ability of the process engine and rules engine to orchestrate system default components or plug-ins into the reconciliation process, based on the Spring bean name. Based on this open design, the business reconciliation platform supports co-construction with the business team.

First, the reconciliation platform provides AN API JAR package for the standard interface, and the business side implements the related interface by importing the JAR and packaging the IMPL. In this way, the reconciliation platform can import the business side’s plug-in package through SPI and load it into the JVM in the reconciliation center for execution.

In the future,

The business reconciliation platform is built for business scenarios, but it is also a data intensive application. Since its launch, the platform has been connected to dozens of reconciliation tasks of various teams of the company and processed tens of millions of data every day. Looking to the future, the mission of business reconciliation platform will evolve from offline data analysis and processing to the direction of using the health data of application system to help the system make real-time adjustment. No one can avoid data consistency in a distributed environment, and we are in awe of it. Please contact [email protected] to exchange ideas.