Abstract

It is an important part of FinTech risk control management to find out user behavior patterns from complex data relationships so as to provide decision-making reference for online risk control strategies. With the help of big data technology, we can quickly capture subtle changes in user behavior patterns and discover deep relationships between users from multiple perspectives, providing strong support for risk control. This share focuses on introducing the machine learning platform of Dianrong and case analysis of data team in risk control business.

On June 11, 2017, Liu Li, head of Data Scientist Team of Dianrong shared his speech “Introduction to the Application of Big Data technology in Dianrong Business” in “Latest Scenario-based Application practice of Big Data of Ele. me & Qiliuyun Joint Forum”. IT big said as the exclusive video partner, by the organizers and speakers review authorized release.

Read the word count: 1722 | 4 minutes to read

Video playback of guest speech:
t.cn/RQE3liO

Point melting profile

Dianrong was founded in Shanghai six years ago and owns Lending Club’s technology platform. Tiger Global, Standard Chartered Private E Quity and many other well-known venture capital funds. It now has 28 offices in China and more than 2,600 employees. We launched China’s first blockchain platform. At present, the total investment amount of users has exceeded 29 billion YUAN, and the interest return to users has exceeded 1 billion yuan.



The general process of machine learning is to take a data set, take the data set and break it down into training sets and test sets. According to the content of the test set, there are numeric types, classified variables, and text types. After getting these variables, fusion is carried out.

The next step is to analyze these features, and then make a Model to select some long algorithms to see which algorithm can best achieve the desired effect. Each algorithm has some external input parameters that are related to its own algorithm. Therefore, we need to adjust parameters according to our own experience and choose the Model with the best effect.

Existing solutions

Charges: Nessus or its companion scheme charges. For example, if it is deployed to the cloud, it can be localized within the company, but for a fee.

Data security: Cloud deployment requires uploading data to the cloud, which many Internet companies, especially those focused on security and quality, are reluctant to do. Even after encrypting some of the key elements in the middle, there are still security concerns.

Data visualization: Many open source tools do not provide data visualization, so you may need to use other open source tools for visualization.

Distributed: There are some algorithms that do not provide distributed and can only run on a single machine. So a lot depends on how much memory the server has.

Model result deployment: Once a process has been trained, how it is deployed to the production line. For those who are not familiar with the process, deployment is an issue that needs our attention and attention.



Our point-melting machine learning platform is based on a Spark cluster, with some secondary development on top of the open source framework, adding features that we felt were important.

HDFS data can be read

Since the point-melting machine learning platform is based on spark clusters, it can read HDFS data, which is the most basic requirement.

Data visualization

We can do data visualization. When the data comes in, click on the one you want, and it describes the feature distribution of the entire data set.

Rank of feature importance

When the data set is read in, it can be prioritized directly with a button, saving time during analysis.

Collinearity analysis

The results of many algorithms will be greatly compromised if the variables are strongly correlated. So we added a feature that allows us to look at correlations between variables.

Model library

The model library already includes most commonly used Spark algorithms, as well as some deep learning algorithms.

Deployment model

We have a one-click Publish button that, when the Model is generated, can directly Publish and generate an interface that will be used to invoke the deployment directly later.



Application of Graph Mining in the field of risk control



The relational data has been connected into a large graph through points and data, in which points and people are known to be bad or good in the historical data, which can be marked.

On this graph you can use the correlation algorithm to do the learning and get some results.



If you pull out all the points associated with each user’s third-degree network, then everyone’s third-degree network is a very small graph. This allows you to map out which small images are good and which are bad based on historical data. Machine learning can be learned entirely through the structure of small graphs.



Smoothing hypothesis: If the coordinates of two points are close, there is a high probability that they belong to the same category. This assumption is a prerequisite for all regression learning.

Clustering hypothesis: Clustering learning can get different subclasses, points in the subclasses are very likely to belong to the same category.

Manifold hypothesis: If some points in the feature space belong to different manifold structures, it should not be very likely to belong to the same category.



In the structure of the graph, some points are more closely related to some people than others, which can be found through some community-related algorithms.

How to improve model performance

Based on the data

If the model doesn’t work well, maybe the first thing to consider is whether the features aren’t good enough, and try to find more features. And whether the data processing is not enough in place, the characteristics of the data itself is not in-depth analysis. You may have made some mistakes in using the data so that the model doesn’t work very well.

With the help of algorithm

If the algorithm itself is more simple or linear, its effect can be used as a standard, using more complex to fit the data set.

Parameter with algorithm

Complex nonlinear algorithms have some hyperparameters, and the more complex the algorithm, the more hyperparameters it has.

Model fusion

Model fusion is an approach that should be tried when the results are not good.

That’s all for today’s sharing, thank you!