Tag: Model deployment PMML

For algorithm engineers, how to successfully deploy the trained model online is also an important part of their work. Of course, deploying a Python trained model in a Python environment is a natural idea, but most of my actual projects require cross-platform deployment of the model (such as a Java environment), Predictive Model Markup Language (PMML) is a solution to cross-platform deployment.

1. What is PMML? 🤔

PMML is a platform – and environment-neutral model representation language based on XML standards. It mainly defines and stores the core elements of an algorithm model through XML Schema:

  • Data dictionary: Describes input data
  • Data conversion: defines how raw data is processed, such as standardization, missing value processing, dummy variable generation, etc
  • Model definition: The type and parameters of the model, such as the split nodes of the tree model
  • Model output: The output of the model

It is not difficult to see that by defining all the core elements in PMML, all the processes of data mining can be completed. That is, when deploying back-end developers, they only need to read the data, call the PMML file, and then get the output data, without paying attention to data transformation, model parameters and other problems, which speeds up the deployment efficiency of the model.

2. How does Python implement PMML file packaging? 💻

This paper takes Iris Iris data set as an example, trains an XGBoost classifier through Pipeline, and then packages it to output PMML files. The Pipeline approach is to merge many different data preprocessing steps and one or more models together sequentially like a splicing Pipeline to complete all phases of data mining and simplify model training and deployment.

Iris Iris data set, containing 4 characteristic variables (Sepal Length, Sepal Width, petal Length and petal Width) and 1 category label (0-setosa, 1-versicolor, 2-virginica), The objective of our model is to distinguish three types of irises based on characteristic variables.

Step 0: Import the required packages
# import packages
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml
from xgboost import XGBClassifier
Copy the code
Step 1: Train a model in Python
# read data
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# split train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# define datamapper
mapper = DataFrameMapper([
  (["sepal length (cm)"."sepal width (cm)"], StandardScaler()),
  (["petal length (cm)"].None)
], df_out=True)


# define a model
xgb = XGBClassifier(n_estimators=5, seed=123)

# make a pipeline
pip_model = PMMLPipeline([
  ('mapper', mapper),
  ("classifier", xgb)]
)

# train a model & predict on test data
pip_model.fit(X_train, y_train)
pred_prob_pip = pip_model.predict_proba(X_test)
Copy the code

In our example, we used the DataFrameMapper to define that there were only three characteristic variables in the input (petal width is removed), and that we normalized only the sepal length and the sepal Width, which is petal Length. Then, an XGBClassifier is defined. Finally, two data processing processes and models are combined for training through PMMLPipeline.

Step 2: Save as a PMML file
# save model 
sklearn2pmml(pip_model, "iris_model.pmml", with_repr = True)
Copy the code

The output PMML file can be opened by any editor and has the following form:

Step 3: Call PMML in Python to verify the result
from pypmml import Model
model = Model.fromFile('iris_model.pmml')
pred_prob_reloaded_model = model.predict(X_test)
Copy the code

Re-reading the PMML file and the original Pipeline prediction results should be consistent in python.

3. What pitfalls are possible? 😂

The example above is simple to implement, but there will always be some potholes in the actual deployment, after all, everyone has grown up in potholes. Here are some of the issues I encountered in deploying PMML for your reference.

  • Pit 1: Models can be trained but PMML files cannot be generated

    May be due to environmental causes, sklearn/sklearn2pmml/sklearn_pandas need, and when sklearn2pmml version is too high in the jar version is too low, will lead to the problem of file cannot be read, this article USES the version information is as follows:

    • Sklearn2pmml = 0.53.0
    • Sklearn = 0.23.1
    • Sklearn_pandas = 2.2.0
  • Pit 2: Java calls to PMML are inconsistent with the original Pipeline model predictions (normally, 5 decimal places should be consistent)

    There are two possible causes of this problem: (1) inconsistent data types; (2) Processing of missing values.

    • If the data type is inconsistent, you can try the following two solutions:

      • Since all variables are double by default and some classification variables are string when PMML files are generated, labelencoder processing for classification variables can be considered to be numeric
      • Change double to string in PMML files
    • Missing value handling: When there is a missing value in the data, and it is not a nan, but is replaced by -9999, the PMML file needs to define -9999 as missing value in both the Datamapper and the model, otherwise prediction inconsistencies will occur. At this time should be

      # define datamapper
      mapper = DataFrameMapper([
        (["sepal length (cm)"."sepal width (cm)"], StandardScaler()),
        (["petal length (cm)"], ContinuousDomain(missing_values=-9999.0, with_data = False))
      ], df_out=True)
      
      
      # define a model
      xgb = XGBClassifier(n_estimators=5, seed=123, missing=-9999.0)
      Copy the code

      Note: When ContinuousDomain is used, the data range of each feature can be found according to the training data, and an error will be reported when the test data displays values beyond this range. Therefore, with_data=False should be set to remove this limitation.

  • Pit 3: The original Pipeline model and the Python PMML file reading model predict inconsistent results

    This situation is also largely related to missing values. If this problem persists after properly handling pit 2, it may be because Python also passed in and assigned -9999 for the missing feature when reading the PMML file model predictions. The correct approach is to use pypmML to call PMML files, which should pass in no missing features, or assign nan, so that pYPMML calls PMML files and the original Pipeline model predict results will be consistent.

Stay tuned for

  • Python constructs the Pipeline model
  • Use Scala to call Python-trained PMML

Prohibit unauthorized reprint, reprint please contact the author, otherwise the author will retain all legal rights.