Algorithm Xiao Bai's introduction to machine learning practice, from zero to online

This practice uses Aliyun machine learning platform PAI

Project background

The original business content is a relatively common judgment business, that is, the input is the measurement information and relevant reference information of an entity with certain error, and the output is the entity that it should belong to. A simple scenario is to enter an unsigned article and assign it to an existing author in the library, or to a new anonymous author, depending on the genre.

The difficulty of the problem is:

The number of classified entities is large, in the order of millions
The number of categories is uncertain and dynamically changing, that is, there are new and expired
Error in measurement
The scene is more

Evaluation criteria:

Attribution should be accurate (articles under the author’s name should not be wrong)
Less missing (try to find the author of the article)
Avoid false creation (do not create multiple entities by the same author)

The original solution is also traditional, that is, a set of strategies are formulated and combined into a decision tree structure, which requires a lot of prior knowledge while being logically heavy. According to the common personal pronoun and grammatical structure, it can judge whether the author belongs to the same author.

Typical decision tree

With the continuous improvement of accuracy requirements, it is increasingly difficult to adjust and maintain the strategy combination, mainly for the following reasons:

The increasing number of policies makes maintenance more difficult
A lot of patches are made for compatibility error, which makes maintenance more difficult
Threshold and policy adjustment are based on experience. Many data need to be verified by regression during each adjustment
There are many scenarios, so it is a lot of work to formulate corresponding optimization strategies for each scenario

In short, the strategy will continue to expand for the foreseeable future, and maintenance will become more difficult. After realizing the technical risks, the group discussed that the machine learning scheme could reduce the complexity of the solution on the one hand (reduce the maintenance cost), and on the other hand facilitate the later scene expansion and capability migration. At the very least, the model should be able to solve common scenarios and then customize optimization strategies based on that (to reduce development costs).

Fortunately, I was selected to undertake the research work. Although I had no previous experience in solving problems with models (pure white), I could not miss the opportunity to broaden my technical horizon. So let’s document the pit trip, and hopefully give other interested non-algorithmic siege lions some confidence that the (floating) model is not as difficult as it is commonly believed.

preparation

Problem definition

The essence of the problem is matching, that is, the target compares the distance between two vectors. There are a number of ways to do this, for example

Classification: participate in training according to the labeled data of various categories, and predict which new data belong to the original classification
Clustering: The aggregation of similar data into one category, resulting in multiple categories
Regression: it is mainly used to obtain a fitting function according to the labeled data and the original predicted value to predict the value of new data

Introduction to regression problem

Introduction to classification Problems

Model selection

Clustering model

Idea: Feature clustering into a class indicates that vector distance is close, belonging to the same classification.

It seems that the use of clustering model is not conducive to the realization of online prediction (clustering is required for each prediction and the investment is high), and there are some limitations in the common clustering model, such as:

K-means, for example, requires the preset number of class clusters K, which is not suitable for this service scenario
DBScan requires some prior knowledge, such as the distance threshold between vectors, which is difficult to determine in this scenario
The clustering model is difficult to support the clustering of millions of class clusters

The regression model

I didn’t really know how to convert the vector matching problem into a regression problem, so I didn’t try very hard.

Multiple classification model

Every data that should belong to the same entity belongs to a classification, and the classification is calculated for all features.

Multi-classification is essentially training one classifier for each class, but training millions of classifiers is obviously not reasonable, abandon the scheme.

Dichotomy model

Through the study of a certain Tianchi Competition, it is found that it is a reasonable idea to directly calculate the “difference” of two feature vectors and mark them according to whether they match in the end, so the binary classification model is finally selected.

Here’s an example:

x_diff	y_diff	match
0.1	0.1	true
0.9	0.9	false

In short, the original decision engine used multipleisMatch(v1, v2, threshold)Rules that meet a specific set of rules are considered to be the same class. After using the machine learning model, we use the[a1-a2, b1-b2, ... , x1-x2]The results of vector mapping in higher dimensional space are classified by model scribes.

The nature of categorical decision making

Characteristics of the engineering

Basic data

The basic attribute data should be selected as far as possible which is relevant to the results, and the training should be avoided for the obviously unrelated attribute calculation features.

Characteristics of computing

As mentioned above, the final feature takes the form of the diFF of the same attribute dimension between the input and the candidate, and the resulting vector is the result of the DIFF of N dimensions.

Special handling

Firstly, it is clear that the machine learning model does not know the relationship between the various dimensions of features, nor how to calculate the diff of feature dimensions, so some processing is needed.
Secondly, some special processing can be done to make the correlation between basic features and results stronger. For example, x and Y coordinates can be converted into two-dimensional coordinates, so that the final DIff result can be transformed from X_diff and Y_diff to Euclidean distance between (x, y)

Some non-numerical features are mapped to (continuous) numerical features by special means, which is obviously helpful to model training results:

Text classes: Try Levenshtein distance (edit distance)
Enumeration class: Try one-hot encoding distance
Distance: cosine similarity, Manhattan distance, etc

This is business specific and not all dimensional characteristics should be addressed.

Data preparation

According to the total amount

10W + eigenvectors calculated by feature engineering.

marking

The binary classification model needs to divide data into two classes, such as class 0 and class 1, or class A and class B, etc. The classification label needs to be reflected in the feature data during training, that is to say, the training data is actually feature vector + classification label.

The classification in this business is whether the two match. To apply the scenario, each article provides multiple candidate authors, and the feature vector formed with the real author attribute diff will be marked as A, while the feature vector formed with other non-real authors will be marked as B.

Since the accuracy of the original decision model is about 9x%, the results of the original business decision model are directly used as the classification label. Of course, it is better to use the results of manual marking, although the accuracy of manual marking is not necessarily 100%.

pretreatment

The normalized
- To avoid too much influence from a single dimension, map the values of each dimension to the range from 0 to 1
- Calculation method:normalized_value = (value - min_value) / (max_value - min_value)

The modeling process

modeling

A simple XGBoost binary modeling

instructions

This process uses Aliyun PAI platform to build, saving the pain of setting up the environment. Of course, this process platform has nothing to do with it.

Read the data table

The characteristic data input after marking.

Preprocessing script

The features of some sparse matrix types are preprocessed and converted into standard sequential formats. That is, the Map structure in Java is converted into a one-dimensional array in the proposed order. Because there are many features, this process uses a code generation script.

Type conversion

Some eigenvalues need to be converted from non-numeric type to numeric type, and some missing values can be filled in.

A full table statistics

Statistics table information, including maximum and minimum values, variances, etc. For reference in data preprocessing, such as the maximum and minimum values of each dimension used in normalization calculation.

This is not a necessary step in the prediction of training, and the output data is mainly for off-platform use.

Data normalization

Map the values of each dimension to 0 ~ 1.

Data resolution

The input full feature data is randomly divided into two parts:

Training data
- In theory, the more training data, the more stable and accurate the prediction of the trained model
Forecast data
- It is used for the prediction test of the trained model, and the prediction result of the model to the predicted data will be the basis of the model evaluation

Here you can split the training data and the prediction data 7:3 or 8:2.

XGBoost binary classification

XGBoost is a model based on the gradient lifting tree (GBDT) algorithm, proposed by Dr. Chen Tianqi’s team.

The thought of GBDT can be explained by a popular example. If a person is 30 years old, we first use 20 years old to fit the loss, and find that the loss is 10 years old. Then we use 6 years old to fit the remaining loss, and find that the gap is still 4 years old. If we have not finished the number of iterations, we can continue to iterate. With each iteration, the age error of the fitting will decrease.

— References

Model to predict

The prediction data were input into the trained model prediction, and the results were used to evaluate the model effect.

Model is derived

Export the trained model to a PMML file.

Confusion matrix/dichotomous evaluation

Used to evaluate model effectiveness.

Model to evaluate

Confusion matrix

The results of

instructions

	Predict Positive	Predict Negative
True Positive	True Positive, the P of P is True	False Negative, N of P, False
Negative value of true	False Positive, P of N is False	True Negative, N of N, True

accuracy

ACC = (TP + TN) / (TP + FN + FP + TN)

The overall accuracy, the number of correct predictions/the total number of predictions.

Accurate rate

PPV = TP / (TP + FP)

That is, the accuracy of P is also correctly predicted. You can also calculate the accuracy of N.

The recall rate

TPR = TP / (TP + FN)

That is, the proportion of P prediction correctness to truth P. You can also compute N.

In general, accuracy can be seen as the comprehensive classification ability, while accuracy and recall can be seen as the predictive ability of one of the categories.

Dichotomous evaluation

The results of binary evaluation are usually calculated ROC/K-S curves.

ROC curve, such as

Evaluation indicators

To put it simply, the higher the ROC curve, the higher the AUC /F1 score /KS, the better the model effect.

Characteristic importance

XGBoost model evaluates the importance of features, and features with very small contribution are discarded in the training process.

At the same time, the trained model can output the ranking of feature importance, that is to say, it can be optimized according to the importance of feature. For example, if the important features of a dimension are calculated from a service, improving the service capability will have a greater impact on the results than improving other capabilities.

The service integration

Model is derived

The trained model can be exported to a standard PMML file.

The PMML format is the general specification format for data mining. A PMML file is actually a long XML file containing the features used and the relationships between features. A PMML file is used to load trained models and perform predictions, meaning that PMML is a “class code” that generates a running “instance.”

Integration to services

PMML files can be loaded simply through the PMML-evaluator package:

<! - rely on -- -- >
<dependency>
    <groupId>org.jpmml</groupId>
    <artifactId>pmml-evaluator</artifactId>
    <version>1.5.9</version>
</dependency>
Copy the code

Integrating the / * * /
@Service
public class PmmlDemo {

    /**
     * 模型pmml文件路径
     */
    private static final String MODEL_PMML_PATH = "/model/gbdt_model_20210106.pmml";

    /** ** model */
    private Evaluator model;

    /**
     * 参数列表
     */
    private List<InputField> paramFields;

    /** * Positive target classification, the same as the Positive classification in training data marking */
    private static final Object TARGET_CATEGORY = "0";

    @PostConstruct
    public void init(a) throws IOException, JAXBException, SAXException {
        model = buildEvaluator();
        paramFields = model.getInputFields();
    }

    The pmML-evaluator 1.5.x version is used slightly differently from 1.4 */
    private static Evaluator buildEvaluator(a) throws JAXBException, SAXException, IOException {
        InputStream inputStream = PmmlDemo.class.getResourceAsStream(MODEL_PMML_PATH);
        PMML pmml = PMMLUtil.unmarshal(inputStream);
        inputStream.close();

        ModelEvaluatorBuilder evaluatorBuilder = new ModelEvaluatorBuilder(pmml, (String)null)
            .setModelEvaluatorFactory(ModelEvaluatorFactory.newInstance())
            .setValueFactoryFactory(ValueFactoryFactory.newInstance());

        return evaluatorBuilder.build();
    }

    /** * Model prediction */
    public Double getPredictScore(BizFeature feature) throws InvocationTargetException, IllegalAccessException {
        if (feature == null) {
            throw new NullPointerException();
        }
        // Read feature data
        Map<String, Object> fieldMap = featureToMap(feature);
        // Populate the model input
        Map<FieldName, FieldValue> params = fillParams(fieldMap);
        / / forecast
        ProbabilityDistribution result = predict(params);
        if (result == null) {
            return null;
        }
        return result.getProbability(TARGET_CATEGORY);
    }

    /** * Transforms business characteristic BO attributes into map structures via reflection * including data preprocessing */
    private static Map<String, Object> featureToMap(BizFeature feature) throws InvocationTargetException, IllegalAccessException {
        Map<String, Object> output = Maps.newHashMapWithExpectedSize(512);

        Method[] methods = BizFeature.class.getDeclaredMethods();
        for (Method method : methods) {
            String key = method.getName();
            if(! key.startsWith("get")) {
                continue;
            }
            key = key.toLowerCase();
            if (key.contains("bizid") || key.contains("entityid") || key.contains("label")) {
                continue;
            }

            Object value = method.invoke(feature);
            key = key.substring(3);
            put(output, key, value);
        }

        return output;
    }

    /** * data preprocessing */
    private static void put(Map<String, Object> outputMap, String key, Object value) {
        if (value instanceof Integer) {
            outputMap.put(key, BizFeatureNormalizationHelper.normalization(key, (Integer)value));
        } else if (value instanceofDouble) { outputMap.put(key, BizFeatureNormalizationHelper.normalization(key, (Double)value)); }}/** * According to the features required by the model, extract the corresponding business characteristic values for filling */
    private Map<FieldName, FieldValue> fillParams(Map<String, Object> map) {
        Map<FieldName, FieldValue> params = Maps.newHashMap();
        for (InputField inputField : paramFields) {
            FieldName inputFieldName = inputField.getName();
            Object rawValue = map.get(inputFieldName.getValue());
            FieldValue inputFieldValue = inputField.prepare(rawValue);
            params.put(inputFieldName, inputFieldValue);
        }

        return params;
    }

    /** * Predictions appear to be thread-safe? Use synchronized * /
    private synchronized ProbabilityDistribution predict(Map<FieldName, FieldValue> arguments) { Map<FieldName, ? > results = model.evaluate(arguments); List<TargetField> targetFields = model.getTargetFields();if (CollectionUtils.isEmpty(targetFields)) {
            return null;
        }

        TargetField targetField = targetFields.get(0);
        FieldName targetFieldName = targetField.getName();
        return(ProbabilityDistribution)results.get(targetFieldName); }}Copy the code

Note:pmml-evaluatorThe model loading method is different from the 1.5.x version before package 1.5.

Online prediction

The prediction results will output the score of 0~1 for each category. In the dichotomies, the results of the two categories are complementary. For example, the prediction score of A is 0.3, while that of B is 0.7.

Through the classification effect evaluation step, a reasonable prediction threshold can be determined, that is, whether the control is 0.4 or 0.6 threshold. If the threshold is smaller than this threshold, it is classified AS A, otherwise, it is classified as B.

The online prediction results may be slightly different from the prediction results of the same data in the model training phase because the statistics used in the data preprocessing phase are not exactly the same. The normalized statistics used in this business training are slightly different from those used in online forecasting.

supplement

After wading across the river, it turns out that solving problems using machine learning models is not as difficult as you might think if you don’t dig too deep. Hopefully this short note will make it easier for more people to switch gears and add a new tool to their toolkit.

In addition, Java/Python also has tools to train models directly and export them to PMML files, which is more convenient for the platform.

Experience on pit

Be sure to define the problem. This is probably the hardest part of Machine learning. (Tears)
The number of samples in the binary classification should be close, otherwise the model may overfit a certain classification
The integration of PMML requires the same data preprocessing as the training model

Reference documentation

Machine learning – Regression problem – Zhihu

Machine learning series (7) — Classification problems (CSDN

Classification evaluation metrics – obfuscation matrix ROC AUC KS AR PSI Lift Gain_snowdroptulip -CSDN blog

Java calls version 1.5.1 PMML_bob71’s blog -CSDN blog

Summary of the principle of Gradient Lifting Tree (GBDT) – Pinard Liu jianping – Blog Park

This article moves my blog, welcome to visit!

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Algorithm Xiao Bai’s introduction to machine learning practice, from zero to online

Project background