C++ to achieve GBDT algorithm and optimization process

It’s actually a class assignment that requires the GBDT algorithm to be implemented. In the realization of the process of reference to a lot of information, also did a lot of optimization, feel the harvest is very big, so the development process is also recorded.

Source code at GitHub.

1. Build and use

1.1 build

Windows: Use Visual Studio 2017 to open the solution and build it.
Linux: The root directory is providedmakefileFile, usingmakeCompile, yesGCC > = 5.4.0

1.2 the use of

Boost

Training data input in LibSVM format is accepted, and each line below represents a training sample:

<label> <feature-index>:<feature-value> <feature-index>:<feature-value> <feature-index>:<feature-value>
Copy the code

Data input for prediction is similar to training data:

<id> <feature-index>:<feature-value> <feature-index>:<feature-value> <feature-index>:<feature-value>
Copy the code

Currently only dichotomous problems are supported

specifies the training parameters:

eta = 1.                 # shrinkage rate
gamma = 0.               # minimum gain required to split a node
maxDepth = 6             # max depth allowed
minChildWeight = 1       # minimum allowed size for a node to be splitted
rounds = 1               # REQUIRED. number of subtrees
subsample = 1.           # subsampling ratio for each tree
colsampleByTree = 1.     # tree-wise feature subsampling ratio
maxThreads = 1;          # max running threads
features;                # REQUIRED. number of features
validateSize = .2        # if greater than 0, input data will be split into two sets and used for training and validation repectively
Copy the code

2. Algorithm principle

The core of GBDT can be divided into two parts, namely Gradient Boosting and Decision Tree:

Decision Tree: GBDT base classifier. By dividing the characteristics of the input samples, the samples with the same characteristics have roughly the same label. As the results of several different Decision trees need to be integrated in GBDT, Regression Tree is generally used instead of Classification Tree.
Gradient Boosting: An iterative integration algorithm, in which the learning objective Y of each decision tree is the residual (i.e. Gradient direction) of conclusion and sum of all previous trees, i.e 。

3. Realization and optimization process

The implementation of each part goes through several iterations of “first release implementation-performance profiling – optimization for next release code”. In the performance profiling section, Visual Studio 2017’s “performance Profiler” feature is used, which is compiled in Release mode (with the /O2 /Oi optimization option turned on) before performing performance profiles.

3.1 Data Processing

The selected input file data format is the Libsvm format as follows:

<label> <feature-index>:<feature-value> <feature-index>:<feature-value>
Copy the code

As you can see, this format is naturally suitable for representing sparse data sets, but during implementation, for simplicity and cache performance, I converted the null values to a dense matrix format by padding them with zeros. The trade-off is that the memory footprint will be much higher.

3.1.1).

No optimization was made initially, and the following simple process was used:

Files are read line by line
For each line, change tostd::stringstreamAnd then parse out the corresponding data.

The core code is as follows:

ifstream in(path);
string line;
while (getline(in, line)) {
    auto item = parseLibSVMLine(move(line), featureCount); // { label, vector }
    x.push_back(move(item.first));
    y.push_back(item.second);
}

/* in parseLibSVMLine */
stringstream ss(line);
ss >> label;
while (ss) {
    char _;
    ss >> index >> _ >> value;
    values[index - 1] = value;
}
Copy the code

The profile results:

As you can see, the main time is in the process of parsing a string into the required label + vector data, further analysis:

So the main problem is the string parsing part. At this point, it is suspected that the implementation of STD :: Stringstream sacrifices performance for thread-safety, error-checking, and so on, so consider using the implementation in CSTDIO.

3.1.2 improvement

Rewrite the implementation of parseLibSVMLine to use sscanf in cstdio instead of STD ::stringstream:

int lastp = - 1;
for (size_t p = 0; p < line.length(); p++) {
    if (isspace(line[p]) || p == line.length() - 1) {
        if (lastp == - 1) {
            sscanf(line.c_str(), "%zu", &label);
        }
        else {
            sscanf(line.c_str() + lastp, "%zu:%lf", &index, &value);
            values[index - 1] = value;
        }
        lastp = int(p + 1); }}Copy the code

The profile results:

As you can see, while the parse part is still the hot spot of computation, the amount of computation in this part is significantly reduced (53823 -> 23181), and the time to read the full data set is reduced by more than 50%.

3.1.3 the final version

Obviously, in a data set, the parsing tasks between each row are independent of each other, so the parsing of the data can be parallelized after reading the entire file at once and dividing the data by row:

string content; getline(ifstream(path), content, '\0'); stringstream in(move(content)); vector<string> lines; string line; while (getline(in, line)) lines.push_back(move(line)); #pragma omp parallel for for (int i = 0; i < lines.size(); i++) { auto item = parseLibSVMLine(move(lines[i]), featureCount); #pragma omp critical { x.push_back(move(item.first)); y.push_back(item.second); }}Copy the code

According to profile results, performance is improved by about 25% after parallelization. Peak CPU usage increased from 15% to 70%. It can be found that the performance improvement is not as high as the CPU usage increase, for the following two reasons:

The IO time to read the file, which was tested using a 672MB data set, was more than 50% of the time to read the entire content
The cost of multithreaded synchronization

3.2 Decision tree generation

The decision tree is generated in a depth-first fashion, in which the tree is continuously drawn down until one of the following termination conditions is encountered:

Reaches the maximum depth limit
The income division is smaller than the threshold
The number of samples in this object is less than the threshold

The general code is as follows:

auto p = new RegressionTree();

// calculate value for prediction
p->average = calculateAverageY();

if (x.size() > nodeThres) {
    // try to split
    auto ret = findSplitPoint(x, y, index);
    if (ret.gain > 0 && maxDepth > 1) { // check splitablity
        // split points
        // ...
        // ...

        p->left = createNode(x, y, leftIndex, maxDepth - 1);
        p->right = createNode(x, y, rightIndex, maxDepth - 1); }}Copy the code

3.2.1 Calculate dividing points

Partitioning on which values of which features is the most central (and time-consuming) part of the decision tree generation process.

The problem is described as follows:

For the data setWe need to find featuresAnd the dividing points on that feature, meet the minimum MSE (mean-square-error mean square error) :

Among them:

Is the mean value of the sample label after division.
.Is the subdata set after partition.

Equivalently, if you useRepresents revenue division:

Among them,Is MSE before division:. 。

Finding the optimal partition point is equivalent to finding the partition scheme with the highest benefits:

3.2.1.1 Sort based implementation

Analysis:

Obviously, 与 Are only related to the part and on the left (right) of the segmentation point, so the benefits of all segmentation cases can be calculated by sorting and enumerating segmentation points from small to large. For each feature, the time complexity is.

The code is as follows:

for (size_t featureIndex = 0; featureIndex < x.front().size(); featureIndex++) {
    vector<pair<size_t.double>> v(index.size());

    for (size_t i = 0; i < index.size(); i++) {
        auto ind = index[i];
        v[i].first = ind;
        v[i].second = x[ind][featureIndex];
    }

    // sorting
    tuple<size_t.double.double> tup;
    sort(v.begin(), v.end(), [](const auto &l, const auto &r) {
        return l.second < r.second;
    });

    // maintaining sums of y_i and y_i^2 in both left and right part
    double wholeErr, leftErr, rightErr;
    double wholeSum = 0, leftSum, rightSum;
    double wholePowSum = 0, leftPowSum, rightPowSum;
    for (const auto &t : v) {
        wholeSum += y[t.first];
        wholePowSum += pow(y[t.first], 2);
    }
    wholeErr = calculateError(index.size(), wholeSum, wholePowSum);

    leftSum = leftPowSum = 0;
    rightSum = wholeSum;
    rightPowSum = wholePowSum;
    for (size_t i = 0; i + 1 < index.size(); i++) {
        auto label = y[v[i].first];

        leftSum += label;
        rightSum -= label;
        leftPowSum += pow(label, 2);
        rightPowSum -= pow(label, 2);

        if (y[v[i].first] == y[v[i + 1].first]) continue; // same label with next, not splitable
        if (v[i].second == v[i + 1].second) continue; // same value, not splitable

        leftErr = calculateError(i + 1, leftSum, leftPowSum);
        rightErr = calculateError(index.size() - i - 1, rightSum, rightPowSum);

        // calculate error gain
        double gain = wholeErr - ((i + 1) * leftErr / index.size() + (index.size() - i - 1) * rightErr / index.size());
        if (gain > bestGain) {
            bestGain = gain;
            bestSplit = (v[i].second + v[i + 1].second) / 2; bestFeature = featureIndex; }}}Copy the code

The profile results:

As you can see, much of the time is spent sorting and preparing data before sorting.

3.2.1.2 Implementation based on sampling buckets

Since the previous sort-based implementation was time-consuming, a different approach was considered. Later, when I looked at the optimization scheme of LightGBM, I found a method called Sampling the Splitting Points (SS) in reference [^1]. Compared with the LightGBM scheme, THE SS method is easier to implement.

SS method is described as follows:

forSo we’re going to sample randomly from these numbersThe samples were sorted and then sampled isometricOne sample to thisAs a sampleThe partition point of a barrel. The literature suggests that if, then there is a high probability of being guaranteedThe number of samples in each bucket is closeSo it’s almost equal.

With this approach, you just need toThe time sampled outA barrel,Time to divide all samples into different buckets.

After bucket partition, we only select the bucket partition point as the candidate node partition point, so we only need to make a little change to the codeTime to find the best segmentation point. Therefore, for each feature, the time complexity of finding the optimal segmentation point is.

Boosting is a method that combines many “suboptimal” decision trees, so the loss caused by SS is acceptable. With Boosting, the loss caused by SS is an acceptable method, although the optimal partition can not be found because only the partition with the value of bucket boundary is considered.

[^1]: Ranka, Sanjay, and V. Singh. “CLOUDS: A decision tree classifier for large datasets.” Proceedings of the 4th Knowledge Discovery and Data Mining Conference. 1998.

The code is as follows (simple selection.) :

Although this is trueThe value is actually going to be, time complexityBut in the test data.It’s small enough. And ifIf you keep increasing, you can simply changeLet’s say one is not greater thanIt doesn’t matter very much.

/* in findSplitPoint */
size_t nSample = size_t(pow(num, . 5)), nBin = size_t(pow(num, 25.));
auto dividers = sampleBinsDivider(x, nSample, nBin);
vector<double> binSums(nBin, . 0), binPowSums(nBin, . 0);
vector<size_t> binSizes(nBin, 0);
for (int i = 0; i < num; i++) {
    auto value = getFeatureValue(featureIndex, i);
    auto into = decideWhichBin(dividers, value);
    auto label = y[i];
    binSums[into] += label;
    binPowSums[into] += pow(label, 2);
    binSizes[into]++;
}
Copy the code

In addition, because the data in the decidebin is very sparse, with most values of 0, adding a special judgment smaller than the first split point will reduce time by about 20% :

  size_t RegressionTree::decideWhichBin(const std: :vector<double>& divider, double value) {
    if (divider.empty() || value <= divider.front()) return 0;
    if (value > divider.back()) return divider.size();
    auto it = lower_bound(divider.cbegin(), divider.cend(), value);
    return it - divider.cbegin();
  }
Copy the code

According to the test results of the same data set and the same parameters, the iteration time of each round using SS method is reduced by about 80%.

3.2.1.3 Adding parallelism

Obviously, when looking for the best partition scheme, the task of looking for the best partition point on different features is independent of each other, so parallelism can be achieved at the feature level:

#pragma omp parallel for
for (int i = 0; i < featureIndexes.size(); i++) {
  /* sampling, bining... * /

  // for each divider
    #pragma omp critical
    if(gain > bestGain) { bestGain = gain; bestSplit = divider; bestFeature = featureIndex; }}Copy the code

With the addition of parallel optimization, the peak CPU usage is increased from 15% to 70%, and the iteration time of each round is reduced by about 60%.

3.2.2 Node Generation

Build in depth-first order until a termination condition is encountered:

auto p = new RegressionTree();
// calculate value for prediction
p->average = average(y);
if (index.size() > max<size_t> (1, config.minChildWeight)) { // if this node is big enough
    // try to split
    auto ret = findSplitPoint(xx, y, index, featureIndexes);
    if (ret.gain > config.gamma && leftDepth > 1) { // check splitablity
        /* split points ... * /
        // start splitting
        if(leftIndex.size() ! =0&& rightIndex.size() ! =0) {
            p->isLeaf = false;
            p->featureIndex = ret.featureIndex;
            p->featureValue = ret.splitPoint;
            // recursively build left and right subtrees
            p->left = createNode(leftX, leftY, config, leftDepth - 1);
            p->right = createNode(rightX, rightY, config, leftDepth - 1); }}}Copy the code

3.3 predict

For each input sample, it is continuously divided downward according to the partitioning conditions of corresponding tree nodes until leaf nodes are encountered. At this point, the average label of training samples in leaf nodes is used as the predicted value:

if (isLeaf) return average;
if (r[featureIndex] <= featureValue) return left->predict(r);
else return right->predict(r);
Copy the code

Obviously, the prediction tasks of different samples are independent from each other, so the prediction between samples can be paralleled:

Data::DataColumn result(x.size());
#pragma omp parallel for
for (int i = 0; i < x.size(); i++) {
    result[i] = predict(x[i]);
}
Copy the code

3.4 Boosting

Boosting is relatively simple, requiring only maintaining residuals after generating a new decision tree each time:

while (roundsLeft--) {
    auto subtree = RegressionTree::fit(xx, residual, config);
    auto pred = subtree->predict(x);
    pred *= config.eta; // shrinkage rate
    residual -= pred;
}
Copy the code

4. Other optimization

4.1 Performance Optimization

4.4.1 sample performance

Originally, STD ::sample: in C++17 was used to sample the segmentation points.

vector<double> samples(s);
vector<size_t> sampleIndex(s);
sample(index.begin(), index.end(), sampleIndex.begin(), s, mt19937{ random_device{}() });
for (size_t i = 0; i < s; i++) samples[i] = v[sampleIndex[i]];
Copy the code

However, from the perspective of profiling results, STD ::sample has serious efficiency issues:

Compare the use of ordinary random sampling:

vector<double> samples(s);
std::random_device rd;
auto gen = std::default_random_engine(rd());
std::uniform_int_distribution<size_t> dis(0, index.size() - 1);
for (size_t i = 0; i < s; i++) samples[i] = v[index[dis(gen)]];
Copy the code

As you can see, the time taken per round can be reduced by more than half without using STD :: SAMPLE.

4.1.2 Node Segmentation

When dividing the data of the left and right subtrees, it will require more memory operation time if the data of X and Y are directly divided. Therefore, the method selected here is as follows: X and Y are fixed, and the subscripts of samples belonging to this node are obtained by indexing:

for (size_t i = 0; i < index.size(); i++) {
    auto ind = index[i];
    if (xx[ret.featureIndex][ind] <= ret.splitPoint) {
        leftIndex.push_back(ind); // to the left
    }
    else {
        rightIndex.push_back(ind); // to the right}}Copy the code

4.2 parallelization

The implementation of parallelization relies on OpenMP and is implemented through compilation macros such as #pragma OMp Parallel.

In the implementation, parallelism is used in the following ways:

Input data processing, see 2.1.3
To find the optimal splitting point, see 2.2.1.3
Forecast, see 2.3

4.3 Cache Performance Optimization

4.3.1 X reforming

An intuitive way to store input data in LibSVM format isShape store. But throughout the algorithm, the access to data is fixed in the training processContinuous access to dimensions (i.e., reading of a particular feature of all samples), such discontinuous memory access can cause cache performance degradation. So before training, I putData were reformatted to feature firstShape, so that in the training process you only need tox[featureIndex]Continuous reads are more cache-friendly.

4.3.2 Index Sorting

As mentioned in 3.1.2, in order to reduce memory operations, the sample partitioning information is passed in the form of an index. However, it was later found that the performance was degraded. After investigation, it was found that the subsample function was added, that is, “only part of the training sample is used for each subtree for training”. To achieve this, when generating the initial index:

// generate subsample
auto sampleSize = size_t(y.size() * config.subsample);
Index index(sampleSize);
std::uniform_int_distribution<size_t> dis(0, y.size() - 1);
for (size_t i = 0; i < index.size(); i++) index[i] = dis(gen); // sample with replacement
Copy the code

The resulting index is unordered, which makes traversal reads of the form X [featureIndex][index[I]] out-of-order and cache-unfriendly. Then sort the generated index to solve:

// generate subsample
auto sampleSize = size_t(y.size() * config.subsample);
Index index(sampleSize);
std::uniform_int_distribution<size_t> dis(0, y.size() - 1);
for (size_t i = 0; i < index.size(); i++) index[i] = dis(gen); // sample with replacement
sort(index.begin(), index.end()); // for cache
Copy the code

4.3.3 Continuous Label Value

Compared with the huge X data, Y has only one column, so it is directly divided into the left and right subtrees of yLeft and yRight to further improve cache friendliness:

// during splitting
vector<size_t> leftIndex, rightIndex;
Data::DataColumn leftY, rightY;
for (size_t i = 0; i < index.size(); i++) {
    auto ind = index[i];
    if (xx[ret.featureIndex][ind] <= ret.splitPoint) {
        leftIndex.push_back(ind); // to the left
        leftY.push_back(y[i]);    // split y
    }
    else {
        rightIndex.push_back(ind); // to the right
        rightY.push_back(y[i]);    // split y}}Copy the code

5. Test comparisons

5.1 performance

Contrast this with XgBoost.

Test environment:

i7-5700HQ + 16GB

Ubuntu 16.04 (Windows Subsystem for Linux)

g++ -std=c++17 -O3 -fopenmp -m64

Training data:

train:

max-depth: 20

subsample = 95

colsample-by-tree = .93

The algorithm:
- Data reading: It takes about 22 seconds
- Training: Each round takes an average of 32s
Xgboost:
- Data reading: about 4s
- Training: the average time of each round is 40s

5.2 Memory Usage

This algorithm: because the data is stored as a dense matrix, so the consumption of memory is large. Take the training data of 672MB as an example, because the data is relatively sparse, the space occupied after reading into the memory for processing expands to 5GB.
Xgboost: the same 672MB of training data consumes about 1.4GB at runtime.

5.3 Accuracy of prediction

Since the SS method is used in this algorithm, the prediction accuracy of the same number of rounds should be lower than that of XGBoost. The simple test is as follows:

The algorithm:
Xgboost:

The simple test means that the training parameters provided to the algorithm are not tuned during the test, but the following configuration is used:

rounds = 5

features = 201

eta = .3

maxThreads = 16

gamma = 1e-4

minChildWeight = 10

maxDepth = 20

validateSize = 0

Subsample = 0.9500

ColsampleByTree = 0.9287

C++ to achieve GBDT algorithm and optimization process

1. Build and use

1.1 build

1.2 the use of

2. Algorithm principle

3. Realization and optimization process

3.1 Data Processing

3.1.1).

3.1.2 improvement

3.1.3 the final version

3.2 Decision tree generation

3.2.1 Calculate dividing points

3.2.1.1 Sort based implementation

3.2.1.2 Implementation based on sampling buckets

3.2.1.3 Adding parallelism

3.2.2 Node Generation

3.3 predict

3.4 Boosting

4. Other optimization

4.1 Performance Optimization

4.4.1 sample performance

4.1.2 Node Segmentation

4.2 parallelization

4.3 Cache Performance Optimization

4.3.1 X reforming

4.3.2 Index Sorting

4.3.3 Continuous Label Value

5. Test comparisons

5.1 performance

5.2 Memory Usage

5.3 Accuracy of prediction

Related Posts

· Fully illustrated RNN, RNN variants, Seq2Seq, Attention mechanism

Use VSCode to develop Python

[Ali DIN] Deep interest in network source analysis of the overall code structure