background

In recent years, cash loan business based on Internet channels has developed rapidly. Both emerging Internet enterprises and traditional financial institutions want to occupy the market quickly and grab customers in this field. However, online loan business is obviously different from other Internet businesses. The gene derived from finance determines the necessity of attaching importance to risks, which is not only related to the benefits of the product, but also directly affects the success of the product.

Pushing the business online means that customers’ information cannot be accurately obtained, and only the authenticity and repayment ability of customers can be verified through limited channels, which greatly increases the risk cost. If the application process is too tedious, the user experience will be reduced, which is not conducive to product promotion and customer use. Therefore, a challenge for risk control of Internet loans is to be able to make clear risk judgments in the shortest possible time with limited data.

application

In the process of establishing risk strategy, it is a typical method to use all kinds of risk variables and related derivative variables and score by expert model. In practical application, we find that in addition to widely used consumption behavior data and basic income data, social relations between users based on specific dimensions are also effective model variables.

The most immediate problem we face in using these variables is the amount of data. If you consider the phone numbers that appear in a user’s phone address book as a form of relational association, assuming that the average number of contacts in each user’s address book is 100, then 1 million registered users have about 100 million contacts. In fact, within about a year of the launch, our tables of social relationships had grown to about 5 billion.

Compared with data storage, variable derivation and query matching are more challenging tasks. A person’s social relationships are typically a “graph” data structure. However, rules in many expert models need to match the relationship of a user at more than 3 layers. The simplest one is to match the number of users hitting the system blacklist after leaping to 3 layers through the contact relationship. Once again, we estimate the average number of contacts at 100. After jumping three levels, the number of contacts to match is 100, 100, 100, 1 million. However, there are many rules with similar computation amount, and the business scenarios that need to invoke these calculation rules are more frequent and have high requirements on response time.

V1.0 solution

During the evaluation phase, we considered several options, each with its own advantages and disadvantages. The first to go was the solution using MySQL. The advantage of using a relational database is the ease of query. In terms of development efficiency, SQL is a necessary skill for developers and data analysts, which can quickly realize the requirements on the function. But at the data storage and computing level, MySQL’s performance is less than satisfactory. In the face of large data volume, the horizontal expansion strategy that MySQL can adopt is no more than dividing database and table, which results in the query logic becoming very complex, difficult to maintain, and serious performance degradation.

Another solution is to use HBase as a data storage solution. The advantage is obvious, it can scale horizontally, the amount of data is no longer the bottleneck. However, its disadvantages are also obvious, that is, it is not friendly to developers, the API of query is not functional, only to obtain a single piece of data through key, or to read in batches through the SCAN API. More importantly, HBase does not support data structures such as graphs. It can only be simulated by using tall tables and storing redundant data.

The third option is to use a pure graph database. First we looked at the open source Titan and found that the project had been abandoned and the main team had apparently developed a new business chart database and started the company. Titan’s storage engine also uses HBase and Cassandra(choose either as required), and the performance does not meet our requirements. Then we examined two open source commercial products, Neo4j and OrientDB. Both offer free community versions, which are certainly less feature-rich than the commercial versions. The community version of Neo4j does not support HA and can only run on a single machine. The data version of OrientDB supports HA and Sharding. Both support various major programming languages in terms of programming interfaces. Neo4j provides its own unique query language cypher based on pattern matching. OrientDB provides SQL-like syntax apis, which have their own strengths.

Finally, the online scheme uses HBase and Neo4j storage together. HBase uses nine server clusters with 32 GB memory and 16 cores to store basic information about service objects and layer 1 association information. Neo4j is responsible for the storage of graph data structure and uses a single server with 256G memory and 2T SSD. Once online, the TPS of the relevant real-time analysis interface is about 300, and 90% of the corresponding time remains at 200ms. The data volume of some tables is kept at 30 million ~ 600 million, and the data volume of some core tables is about 3 billion.

V2.0 – TiDB and optimizations introduced

The system is generally stable after it goes online, but there are still some problems that need to be solved. As a system for storing graph data, Neo4j can only support single node in the function of the community version and cannot be extended horizontally. Although it can meet business requirements in terms of performance and function at the present stage, it can be predicted that there will be bottlenecks in the near future. The disadvantage of HBase is that the data structure is too simple to provide easy-to-use data interfaces for OLAP systems and analysts. Data can only be synchronized to the data warehouse through ETL in batches, resulting in poor real-time performance.

After communicating with PingCAP’s technical team, I learned about TiDB, a distributed database developed by Chinese people. There are a number of features that fascinate us in TiDB, and two of them help us solve our existing problems. The first is the ability to scale horizontally in an almost infinite way while maintaining transactional characteristics. In this way, it not only avoids the tedious task of dividing databases and tables, but also makes it possible for massive data to be stored in a relational model. Secondly, it is highly compatible with MySQL protocol, which not only provides a good application interface for developers and data personnel. Based on these features, we found that the line between online OLTP and OLAP was blurred and could be completely integrated in some business scenarios.

After communicating with PingCAP technical colleagues, we quickly designed a new technical solution (V2.0). In order to consider the stability of the technical solution migration, we first use Kafka as a data bypass, and store a copy of all the basic data in TiDB cluster. At the same time, about 7 billion Vertex and related EDGE data in Neo4j were moved out and stored in TiDB. Then the SQL interface based on the relational model realizes the partial graph algorithm required by the function, including the shortest path, multi-node connectivity check, etc. Although the implementation process is a bit more work than using Neo4j, there are many improvements in performance, especially throughput. The original transaction model of Neo4j was cumbersome, and it was easy to cause transaction locks for a long time when vertex updates were frequent and the concurrency was large, which seriously reduced the throughput performance of the system.

When we finally went live, we deployed a cluster of nine servers for TiDB. Three of them serve as servers for PD and TiDB, and six serve as storage servers for TiKV. After running for some time, the performance has been stable, except for some bugs in the business logic, and there has never been a problem. In addition, with the increase of business volume, the TPS index also increased to about 5000, and the peak computing capacity of the whole database platform increased by about 10 times, but the overall throughput and response time of the platform did not have any special jitter, and remained stable within the acceptable range.

The biggest improvement for risk analysts is the ability to connect directly to the production TiDB library with SQL tools they are familiar with. Not only the real-time performance is greatly increased, the work efficiency is improved qualitatively, but also part of the ETL work is saved.

Next Step

After gaining practical knowledge and application experience of TiDB, we plan to use TiDB instead of HBase to store data related to user risk model. At the same time, we try to slowly migrate the data of Neo4j into TiDB and finally return to the framework of relational model. However, we are no longer in the hands of the aging MySQL, but a new generation of distributed database TiDB.

Author: Jin Zhonghao, director of 360 Finance/Data Intelligence Department/Department