1. The background

The XGBoost model is widely used in data science competitions and industry as a machine learning killer. XGBoost also provides code that runs on a variety of platforms and environments, such as XGBoost on Spark for distributed training. However, in the official implementation of XGBoost on Spark, there is an issue of instability due to XGBoost missing values and Spark’s sparse representation mechanism.

The incident originated from the feedback from the user of a machine learning platform in Meituan. The XGBoost model trained on the platform used the same model and the same test data, but the results calculated by the local invocation (Java engine) and the platform (Spark engine) were inconsistent. However, the student ran two engines (Python engine and Java engine) locally for testing, and the execution results were consistent. So is it a problem to question the platform’s XGBoost predictions?

The platform optimized the XGBoost model several times, and during XGBoost model testing, there was no inconsistency between local invocation (Java engine) and platform (Spark engine) calculation results. In addition, the version running on the platform and the version used locally by the student are both from the official version of Dmlc. The underlying code called by JNI should be the same code. Theoretically, the results should be exactly the same, but in practice they are different.

From the test code provided by the student, no problem was found:

// One row, 41 columns in the test result
double[] input = new double[] {1.2.5.0.0.6.666666666666667.31.14.29.28.0.1.303333.2.8555.2.37.701.463.3.989.3.85.14400.5.15.79.11.45.0.915.7.05.5.5.0.023333.0.0365.0.0275.0.123333.0.4645.0.12.15.082.14.48.0.31.8425.29.1.7.7325.3.5.88.1.08.0.0.0.32];
// Convert to float[]
float[] testInput = new float[input.length];
for(int i = 0, total = input.length; i < total; i++){
  testInput[i] = new Double(input[i]).floatValue();
}
// Load the model
Booster booster = XGBoost.loadModel("${model}");
// convert to DMatrix, one row, 41 columns
DMatrix testMat = new DMatrix(testInput, 1.41);
// Call the model
float[][] predicts = booster.predict(testMat);
Copy the code

The result of this code is 333.67892 locally and 328.1694030761719 on the platform.

How can it be different? What’s the problem?

2. Troubleshooting process of inconsistent execution results

How to check? The first thought of the direction of the investigation is that the input field types in the two processing methods will not be consistent. If the fields in the two inputs differ in type, or if the decimal precision is different, the different results can be explained. Analyze the input of the model carefully, notice that there is a 6.66666666667 in the array, is it the cause?

Each Debug carefully compares the input data and field types on both sides.

This eliminates the problem of inconsistent field type and accuracy when the two methods are processed.

The second problem is that XGBoost on Spark provides XGBoostClassifier and XGBoostRegressor apis based on JNI, which add many hyperparameters and encapsulates many upper-layer capabilities. Could it be that some newly added hyperparameters in the two encapsulation processes have special processing on the input results, resulting in inconsistent results?

After communicating with the students who feedback this problem, we know that the hyperparameters set in the Python code are exactly the same as those set on the platform. Examine the source code for XGBoostClassifier and XGBoostRegressor carefully, and neither does anything special for the output.

Again, the XGBoost on Spark hyperparameter encapsulation problem is eliminated.

Once again, the model input is checked for special values, such as NaN, -1, 0, and so on. Sure enough, there are several zeros in the input array. Could there be a missing value processing problem?

Quickly find the source code of both engines, and find that the two are really inconsistent in the handling of missing values!

Handling of missing values in XGBoost4j

The handling of the XGBoost4j missing value occurs during the construction of the DMatrix, and 0.0f is set to the missing value by default:

  /**
   * create DMatrix from dense matrix
   *
   * @param data data values
   * @param nrow number of rows
   * @param ncol number of columns
   * @throws XGBoostError native error
   */
  public DMatrix(float[] data, int nrow, int ncol) throws XGBoostError {
    long[] out = new long[1];
    
    //0.0f as missing value
    XGBoostJNI.checkCall(XGBoostJNI.XGDMatrixCreateFromMat(data, nrow, ncol, 0.0 f, out));
    
    handle = out[0];
  }
Copy the code

Missing value processing in XGBoost on Spark

Xgboost on Spark uses NaN as the default missing value.

/** * @return A tuple of the booster and the metrics used to build training summary */
@throws(classOf[XGBoostError])
def trainDistributed(
    trainingDataIn: RDD[XGBLabeledPoint],
    params: Map[String.Any],
    round: Int,
    nWorkers: Int,
    obj: ObjectiveTrait = null,
    eval: EvalTrait = null,
    useExternalMemory: Boolean = false.//NaN is the value of missing
    missing: Float = Float.NaN,
    
    hasGroup: Boolean = false) : (Booster.Map[String.Array[Float]]) = {
    / /...
    }
Copy the code

That is, when a native Java call constructs the DMatrix, if no missing value is set, the default value 0 is treated as if it were missing. In XGBoost on Spark, NaN is missing by default. The default missing values for the Java engine and the XGBoost on Spark engine are different. The platform and the student call, neither set the missing value, causing the two engine execution results inconsistent reason, is because of the missing value inconsistent!

Modify the test code, set the missing value as NaN on the Java engine code, and the execution result is 328.1694, which is completely consistent with the platform calculation result.

    // One row, 41 columns in the test result
    double[] input = new double[] {1.2.5.0.0.6.666666666666667.31.14.29.28.0.1.303333.2.8555.2.37.701.463.3.989.3.85.14400.5.15.79.11.45.0.915.7.05.5.5.0.023333.0.0365.0.0275.0.123333.0.4645.0.12.15.082.14.48.0.31.8425.29.1.7.7325.3.5.88.1.08.0.0.0.32];
    float[] testInput = new float[input.length];
    for(int i = 0, total = input.length; i < total; i++){
      testInput[i] = new Double(input[i]).floatValue();
    }
    Booster booster = XGBoost.loadModel("${model}");
    // One row, 41 columns
    DMatrix testMat = new DMatrix(testInput, 1.41, Float.NaN);
    float[][] predicts = booster.predict(testMat);
Copy the code

3. Instability caused by missing values in XGBoost on Spark source code

However, things are not so simple.

There is also a hidden missing value processing logic in Spark ML: SparseVector, or sparse vectors.

SparseVector and DenseVector are both used to represent a vector, and the only difference between them is the storage structure.

DenseVector is a normal Vector store that stores each value in a Vector in order.

SparseVector is a sparse representation, which is used for data storage in the scenario where there are many zeros in the vector.

SparseVector is stored by simply recording all non-zero values and ignoring all zeros. In particular, one array records the positions of all non-zero values, and another array records the values corresponding to those positions. With these two arrays, plus the total length of the current vector, you can restore the original array.

Therefore, SparseVector can save a lot of storage space for a very large set of zeros.

An example of SparseVector storage is shown below:

As shown in the figure above, SparseVector does not store the parts of the array with a value of 0, only the non-0 values are recorded. So zero actually takes up no storage. The following code implements the VectorAssembler in Spark ML. As you can see from the code, if the value is 0, it is not recorded in the SparseVector.

    private[feature] def assemble(vv: Any*) :Vector = {
    val indices = ArrayBuilder.make[Int]
    val values = ArrayBuilder.make[Double]
    var cur = 0
    vv.foreach {
      case v: Double= >//0 does not save
        if(v ! =0.0) {
        
          indices += cur
          values += v
        }
        cur += 1
      case vec: Vector =>
        vec.foreachActive { case (i, v) =>
          
          //0 does not save
          if(v ! =0.0) {
          
            indices += cur + i
            values += v
          }
        }
        cur += vec.size
      case null= >throw new SparkException("Values to assemble cannot be null.")
      case o =>
        throw new SparkException(s"$o of type ${o.getClass.getName} is not supported.")}Vectors.sparse(cur, indices.result(), values.result()).compressed
    }
Copy the code

A value that does not occupy storage space is also a kind of missing value. SparseVector is used by all algorithm components, including XGBoost on Spark, as a storage format for arrays in Spark ML. XGBoost on Spark does treat 0 in the Sparse Vector as missing:

    val instances: RDD[XGBLabeledPoint] = dataset.select(
      col($(featuresCol)),
      col($(labelCol)).cast(FloatType),
      baseMargin.cast(FloatType),
      weight.cast(FloatType)
    ).rdd.map { case Row(features: Vector, label: Float, baseMargin: Float, weight: Float) = >val (indices, values) = features match {
      
        //SparseVector format, only non-zero values into XGBoost for evaluation
        case v: SparseVector => (v.indices, v.values.map(_.toFloat))
        
        case v: DenseVector= > (null, v.values.map(_.toFloat))
      }
      XGBLabeledPoint(label, indices, values, baseMargin = baseMargin, weight = weight)
    }
Copy the code

Why does XGBoost on Spark introduce instability by using 0 as a missing value in the SparseVector?

The important thing is that Spark ML is optimized for storing vectors. It automatically selects whether to store them as SparseVector or DenseVector based on the contents of the Vector array. That is, Spark saves a Vector column in two formats: SparseVector and DenseVector. And for a column in a data set, the two formats exist simultaneously, with some lines being Sparse and others Dense. The choice of which format to use is calculated with the following code:

  /** * Returns a vector in either dense or sparse format, whichever uses less storage. */
  @Since("2.0.0")
  def compressed: Vector = {
    val nnz = numNonzeros
    // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 12 * nnz + 20 bytes.
    if (1.5 * (nnz + 1.0) < size) {
      toSparse
    } else {
      toDense
    }
  }
Copy the code

In the XGBoost on Spark scenario, float.nan is missing by default. If a row in the data set is a DenseVector, the missing value for that row is float.nan when executed. If a row in the data set is a SparseVector, the missing values for that row are float.nan and 0, since XGBoost on Spark uses only non-zero values in the SparseVector.

That is, if a row of data in the dataset is suitable to be stored as a DenseVector, the missing value for that row is float.nan when XGBoost processes it. If the row is suitable for storing as a SparseVector, the missing values for the row are float.nan and 0 when XGBoost processes it.

That is, one part of the data set will have float.nan and 0 as missing values, and the other will have float.nan as missing values! That is, in XGBoost on Spark, the value of 0 has two meanings depending on the underlying data store structure, which is entirely determined by the data set.

Since only one missing value can be set during online Serving, the selection of a test set in SparseVector format may cause the calculation to be inconsistent with the expected result during online Serving.

Problem solving

A look at the latest XGBoost on Spark source code still doesn’t solve this problem.

We quickly took this issue back to XGBoost on Spark and made changes to our own XGBoost on Spark code.

   val instances: RDD[XGBLabeledPoint] = dataset.select(
      col($(featuresCol)),
      col($(labelCol)).cast(FloatType),
      baseMargin.cast(FloatType),
      weight.cast(FloatType)
    ).rdd.map { case Row(features: Vector, label: Float, baseMargin: Float, weight: Float) = >// We need to change the return format of the original code
      val values = features match {
      
        // The data of SparseVector is transformed into Dense
        case v: SparseVector => v.toArray.map(_.toFloat)
        
        case v: DenseVector => v.values.map(_.toFloat)
      }
      XGBLabeledPoint(label, null, values, baseMargin = baseMargin, weight = weight)
    }
Copy the code
    /** * Converts a [[Vector]] to a data point with a dummy label. * * This is needed for constructing a [[ml.dmlc.xgboost4j.scala.DMatrix]] * for prediction. */
    def asXGB: XGBLabeledPoint = v match {
      case v: DenseVector= >XGBLabeledPoint(0.0f, null, v.values.map(_.toFloat))
      case v: SparseVector= >// The data of SparseVector is transformed into Dense
        XGBLabeledPoint(0.0f, null, v.toArray.map(_.toFloat))
        
    }
Copy the code

The problem was solved and the model trained with the new code improved a bit, which was a bonus.

Hope this article can be helpful to students who encounter the problem of XGBoost missing value, and welcome everyone to exchange and discuss.

Author’s brief introduction

  • Zhao Jun, technical expert of algorithm platform team of Meituan Distribution Division.

Recruitment information

Meituan Distribution Division algorithm platform team, responsible for the construction of Meituan one-stop large-scale machine learning platform Turing platform. Around the whole life cycle of the algorithm, the visual drag and drop method is used to define the model training and prediction process, providing powerful model management, online model prediction and feature service capabilities, and providing multi-dimensional AB splitting support and online effect evaluation support. The mission of the team is to provide a unified, end-to-end, one-stop self-service platform for algorithm-related students, to help algorithm-related students reduce the complexity of algorithm development and improve the efficiency of algorithm iteration.

We are looking for senior R&D engineer/technical expert/direction leader (machine learning platform/algorithm platform) in data engineering, data development, algorithm engineering, algorithm application and other fields. Please join us. Your resume can be sent to [email protected] (mark: Meituan Distribution Division).