The previous article introduced XGBoost, a gradient elevation decision tree model. In this article, we will continue to study another evolutionary version of GBDT model: LightGBM. LigthGBM is a new member of Boosting set model, which is provided by Microsoft. Like XGBoost, LigthGBM is an efficient implementation of GBDT. Similar to GBDT and XGBoost, LigthGBM adopts the negative gradient of loss function as the residual approximation of the current decision tree to fit the new decision tree.

LightGBM will perform better than XGBoost in many ways. It has the following advantages:

  • Faster training efficiency
  • Low memory usage
  • Higher accuracy
  • Parallel learning is supported
  • Can handle large scale data
  • Supports direct use of category characteristics

As you can see from the experimental data below, LightGBM is nearly 10 times faster than XGBoost, with approximately 1/6 of the memory usage and improved accuracy.

After looking at these amazing results, two questions arise: Why go for a faster, less memory-intensive model when XGBoost is already perfect? What are the technical details of GBDT algorithm improvement and enhancement?

Propose the motivation for LightGBM

Common machine learning algorithms, such as neural networks, can be trained in mini-batch mode, and the size of training data is not limited by memory. GBDT needs to traverse the entire training data for many times in each iteration. Loading the entire training data into memory limits the size of the training data. If not loaded into memory, repeatedly reading and writing training data will consume a very large amount of time. Especially in the face of industrial massive data, ordinary GBDT algorithm can not meet its needs.

The main reason proposed by LightGBM is to solve the problems encountered by GBDT in mass data, so that GBDT can be better and faster used in industrial practice.

XGBoost’s pros and cons

Accurate greedy algorithm

During each iteration, the entire training data needs to be traversed several times. Loading the entire training data into memory limits the size of the training data. If not loaded into memory, repeatedly reading and writing training data will consume a very large amount of time.

Advantages:

  • You can find the exact partition conditions

Disadvantages:

  • Huge amount of computation
  • Huge memory footprint
  • It is easy to overfit

Level-wise iteration

Pre-sorted: First, it consumes a lot of space. Such an algorithm needs to save the eigenvalues of the data and the results of feature sorting (such as the sorted index for quick follow-up calculation of segmentation points), which consumes twice the memory of the training data. Secondly, there is also a large cost in time. When traversing each segmentation point, it is necessary to calculate the splitting gain, which costs a lot of consumption.

Advantages:

  • You can use multiple threads
  • Can speed up accurate greedy algorithms

Disadvantages:

  • Inefficient and may produce unnecessary leaves

Not cache optimization friendly

After pre-sorting, the access of features to the gradient is random, and different features are accessed in different order. Therefore, the cache cannot be optimized. At the same time, when each layer of the tree is long, a random array of row index to leaf index needs to be accessed, and the order of accessing different features is different, which also causes large cache miss.

Where is LightGBM optimized?

This is not so much a weakness of XGBoost as a focus of LightGBM’s authors when they built the new algorithm. What problem is solved, then the original model does not solve the shortcomings of the original model.

In summary, lightGBM has the following features:

  • Decision tree algorithm based on Histogram
  • Leaf growth strategies for Leaf-wise with depth constraints
  • Histogram do differential acceleration
  • Support for Categorical Feature directly
  • Cache hit ratio optimization
  • Sparse feature optimization based on histogram
  • Multithreading optimization

Decision tree algorithm

XGBoost uses the pre-sorted algorithm, which can find data separation points more accurately.

  • First, all features are pre-sorted numerically.
  • Secondly, in each sample segmentation, the optimal segmentation point of each feature is found at the cost of O(# data).
  • Finally, the final features and segmentation points are found, and the data is split into left and right sub-nodes.

This pre-sorting algorithm can accurately find the breaking points, but it is expensive in space and time.

  • Since the features need to be pre-sorted and the sorted index values need to be saved (for quick subsequent splitting points), the memory needs to be twice as much as the training data.
  • When traversing each segmentation point, it is necessary to calculate the splitting gain, which costs a lot.

LightGBM uses the histogram algorithm, which uses less memory and requires less complexity to separate data. The idea isto discretize continuous floating-point features into K discrete values and construct Histogram of width K. Then the training data is traversed and the cumulative statistics of each discrete value in the histogram are calculated. In feature selection, we only need to find the optimal segmentation point according to the discrete value of the histogram.

There are many advantages to using histogram algorithms. First of all, the most obvious is the reduction of memory consumption. Histogram algorithm not only does not need to store additional pre-sorting results, but can only store the value after feature discretization, which is generally enough to store with 8-bit integer type, and the memory consumption can be reduced to 1/8 of the original.

Then, the calculation cost is also greatly reduced. The pre-sorting algorithm needs to calculate the split gain once every time it traverses a feature value, while the histogram algorithm only needs to calculate k times (k can be considered as a constant), and the time complexity is optimized from O(#data*#feature) to O(k*#features).

Histogram algorithm

Histogram algorithm should be translated into Histogram algorithm. The idea of Histogram algorithm is also very simple. Firstly, it converts continuous floating point data into bin data. And then I’ll write it as a histogram. (It looks fancy, but it’s just histogram statistics, and we ended up putting large numbers of data in histograms.)

The histogram algorithm has a few caveats:

  • Using bin instead of raw data is equivalent to increasing regularization;
  • Using bin means that many details of data are abandoned, and similar data may be divided into the same bucket, so that the differences between data disappear.
  • Bin quantity selection determines the degree of regularization. The less bin, the more severe the punishment and the higher the risk of under-fitting.

The histogram algorithm needs to pay attention to:

  • There is no need to sort the data when building histograms (faster than XGBoost) because bin is pre-scoped;
  • In addition to the partitioning threshold and the current bin sample number, the histogram also stores the first-step degree and (the mean square of the first-step degree and is equivalent to the mean square loss) of all the samples in the current bin.
  • Threshold is selected according to the histogram traversal from small to large, using the above step degree and, in order to obtain the maximum characteristics and threshold of △ Loss after division.

Advantages and disadvantages of Histogram algorithm:

  • The Histogram algorithm is not perfect. After the feature is discretized, the segmentation point is not very accurate, so it will affect the result. But in the actual data set, it is shown that the influence of the discrete split point on the final accuracy is not much, and even better. The reason is that the decision tree itself is a weak learner, and the adoption of Histogram algorithm can achieve regularization effect and effectively prevent over-fitting of the model.
  • Time overhead dropped from O(#data * #features) to O(k * #features). Due to discretization, #bin is much smaller than #data, so there is a significant improvement in time.

The Histogram algorithm can be accelerated further. The Histogram of a leaf node can be directly obtained by the difference between the Histogram of the parent node and the Histogram of its brother node. In general, constructing Histogram requires traversing all data on the leaf; by this method, only K pokes of Histogram need to be traversed. The speed has been doubled.

Decision tree growth strategy

Based on the Histogram algorithm, LightGBM is further optimized. First, it abandons the Level-wise decision tree growth strategy used by most GBDT tools in favor of the leaf-wise algorithm with depth constraints.

XGBoost adopts the level-WISE growth strategy, which can split the leaves of the same layer at the same time, so as to carry out multithreading optimization, which is not easy to over-fit. But treating the same layer of leaves indiscriminately introduces a lot of unnecessary overhead. Because many leaves actually have low splitting gain, there is no need to search and split.

LightGBM uses a leaf-wise growth strategy to find the leaf with the largest splitting gain (and generally the largest amount of data) from all the current leaves at a time, then divide, and so on. Therefore, compared with Level-WISE, leaf-WISE can reduce more errors and achieve better accuracy under the same number of splitting times. The disadvantage of LEAF-wise is that it may grow a relatively deep decision tree, resulting in overfitting. LightGBM therefore adds a maximum depth limit to Leaf-wise to ensure high efficiency while preventing overfitting.

Histogram difference acceleration

Another optimization of LightGBM is the differential acceleration of Histogram. An easily observed phenomenon is that the histogram of a leaf can be obtained by subtracting the histogram of its parent node from that of its sibling. Usually, to construct a histogram, it is necessary to traverse all the data on the leaf, but the histogram difference only needs to traverse k buckets of the histogram. Using this method, LightGBM can construct a histogram of a leaf and get a histogram of its brother leaf at a fraction of the cost, doubling its speed.

Direct support for category characteristics

In fact, most machine learning tools cannot directly support category features. Generally, category features need to be converted into one-hotting features, which reduces the efficiency of space and time. The use of category features is common in practice. With this in mind, LightGBM optimizes support for category features to be entered directly without the need for additional 0/1 expansion. The decision rules of category feature are added to the decision tree algorithm.

One-hot encoding is a general method to deal with category features, however, it may not be a good method in tree models, especially if there are many categories in category features. The main questions are:

  • It may not be possible to shard on this category feature (that is, waste this feature). Using one-hot encoding means that only one vs REST (dog, cat, etc.) shards can be used on each decision node. When there are many category values, there may be less data for each category, and then the sharding will be unbalanced, which means that the sharding gain will be small (intuitively, there is no difference between unbalanced sharding and no sharding).
  • It will affect the learning of decision trees. Because even if it can be segmented in this category feature, the data will be segmented into many fragmentary small Spaces, as shown on the left side of Figure 1. However, decision tree learning uses statistical information. In the space with small amount of data, inaccurate statistical information will worsen learning. However, if the split method on the right is used, the data will be split into two larger Spaces and further learning will be better.

The meaning of the leaf node on the right is X=A or X=C to the left child and the rest to the right child.

General process of LightGBM processing classification features:

In order to solve the shortcomings of one-hot encoding processing category features. LightGBM adopts Many vs Many segmentation to achieve the optimal segmentation of category features. LightGBM allows you to enter category features directly and produce the effect shown on the right. To find the optimal segmentation in a k-dimensional category feature, the complexity of the naive enumeration algorithm is, while LightGBM uses such asOn Grouping For Maximum HomogeneityThe method is implementedIn the algorithm.

The algorithm process is shown in the following figure: Before enumerating segmentation points, sort the histogram according to the mean value of each category; Then the optimal segmentation points are enumerated according to the mean value. As you can see in the figure below, Sum(y)/Count(y) is the mean of the categories. Of course, this method is easy to overfit, so many constraints and regularizations have been added to this method in LGBM.

Below is a simple comparison experiment, which shows that the optimal method improves the AUC by 1.5 points and only 20% more time.

The following is the process of how to solve the optimal segmentation of category features in the code:

  • Discrete characteristics of the process of establishing histogram: statistics under the characteristics of every kind of the number of occurrences of discrete values, and from high to low, and filters out less occurrences of characteristic value, then for each characteristic value, set up a bin container, for the eigenvalues of the occurrences in the bin container less directly filtered out, not establish bin container.
  • Procedure for calculating the splitting threshold:
    • First look at the number of bin containers divided by this feature. If the number of bin containers is less than 4, scan each bin container one by one in one vs Other mode to find the optimal splitting point.
    • For the case that there are many bin containers, the filtering is carried out first, and only bin containers with large subsets are allowed to participate in the calculation of partition threshold. The formula is calculated for each qualified bin container (the formula is as follows: The sum of the first step degrees of all the samples in the bin container/the sum of the second order gradients of all the samples in the bin container + the regular term (parameter cat_smooth). In fact, the above example is just for easy understanding, only for the case of learning a tree and regression problem, when the first derivative is Y and the second derivative is 1), a value is obtained, according to which the bin container is sorted from small to large, and then searched from left to right and right to left to obtain the optimal splitting threshold. However, instead of searching all bin containers, we set an upper limit of the number of bin containers to be searched, which is 32, the max_num_cat parameter. LightGBM implements the many vs many strategy for discrete features. All bin containers to the left or right of the optimal threshold of the 32 bin containers are one MANY set, and all other bin containers are another MANY set.
    • For continuous features, classification threshold is only one, for there may be multiple threshold is divided into discrete values, each partition threshold corresponds to a bin container number, when using the discrete characteristics were divided, as long as the sample data corresponding to the corresponding bin bin container number in these thresholds set, this article divides data join left subtree, Otherwise join the split right subtree.

Direct support for efficient parallelism

LightGBM natively supports parallel learning and currently supports feature parallelism and data parallelism. The main idea of feature parallelism is to find the optimal segmentation points on different feature sets for different machines and then synchronize the optimal segmentation points between machines. Data parallelism allows different machines to construct histograms locally, then merge them globally, and finally find the optimal segmentation point on the merged histograms.

LightGBM is optimized for both of these parallel methods. In the feature parallel algorithm, the communication of data segmentation results is avoided by saving all data locally. In data parallelism, Reduce Scatter is used to distribute the tasks of histogram merging among different machines, reducing communication and computation. Histogram is also used for difference, further reducing the traffic by half.

Data parallelism based on voting further optimizes the communication cost in data parallelism and makes the communication cost become constant level. When there is a large amount of data, voting parallelism can achieve a very good acceleration effect.

 

For more details, please refer to NIPS2016: A Communication-efficient Parallel Algorithm for Decision Tree

Network communication optimization

Because XGBoost adopts the pre-sorted algorithm, the communication cost is very high, so the histogram algorithm is also adopted in parallel. As the communication cost of histogram algorithm adopted by LightGBM is small, the linear acceleration of parallel computation can be realized by using the set communication algorithm.

LightGBM principle

LightGBM A Highly Efficient Gradient Boosting

Boosting Tree is an optimization process of learning based on addition model and forward distribution algorithm. It has some efficient implementation, such as XGBoost, pGBRT, and GBDT(Gradient Boosting Decision Tree). GBDT uses negative gradient as the partitioning indicator (information gain), XGBoost uses the second derivative. Their common shortcoming is that computing information gain requires scanning all samples to find the optimal partition point. Their efficiency and scalability are not satisfactory in the face of large amounts of data or high feature dimensions. The direct method to solve this problem is to reduce the amount of feature and data without affecting the accuracy. Some work is done to accelerate the booisting process according to the data weight sampling, but GBDT cannot be applied because there is no sample weight.

Microsoft’s open source LightGBM (gbDT-BASED) is a good solution to these problems, it mainly contains two algorithms:

Gradient-based One-side Sampling (GOSS)

GOSS (from the perspective of reducing samples) : Exclude most of the samples with small gradients and calculate the information gain with only the remaining samples. GBDT, although no data is weight, each data instances with different gradient, based on the definition of calculate the information gain, the gradient of instance has a greater impact to the information gain, so the next sampling, we should try to keep the gradient bigger sample (between the preset threshold, or the highest percentage), random remove gradient of small samples. We prove that this method can obtain more accurate results than random sampling at the same sampling rate, especially when the information gain range is large.

Exclusive Feature Bundling (EFB)

  • Efbs (in terms of reduced features) : Binding mutually exclusive features, which means that they rarely both take non-zero values (i.e., replace them with a composite feature). Usually in real applications, although the number of features is relatively large, but because the feature space is very sparse, can we design a nondestructive method to reduce the effective features? Especially in sparse feature Spaces, where many features are almost mutually exclusive (e.g., many features do not have non-zero values at the same time, like one-hot), we can bundle mutually exclusive features. Finally, the binding problem is reduced to the graph coloring problem, and the approximate solution is obtained by greedy algorithm.

The GBDT algorithm combining GOSS and EFB is LightGBM.

Gradient-based One-side Sampling (GOSS)

GOSS is an algorithm that strikes a balance between reducing data volume and ensuring accuracy. GOSS is to reduce the amount of computation by distinguishing the instances of different gradients, reserving the instances of larger gradients and randomly sampling the smaller gradients, so as to achieve the purpose of improving efficiency.

Algorithm description

In AdaBoost, sample weight is an indicator of the importance of a data instance. However, there is no original sample weight in GBDT, so weight sampling cannot be applied. Fortunately, we observed that each data in GBDT has a different gradient value, which is very useful for sampling. That is, the gradient of the instance is small, and the training error of the instance is small, which has been well learned. The direct idea is to discard the data with a small gradient. However, this will change the distribution of data, which will affect the accuracy of the training model. In order to avoid this problem, GOSS is proposed.

GOSS keeps all instances with large gradients and uses random sampling on instances with small gradients. In order to offset the influence on data distribution, GOSS introduces constant multipliers to data with small gradients when calculating information gain. GOSS firstly sorted top A instances according to the gradient absolute value of the data. B instances are then randomly sampled from the remaining data. Then the information gain is calculated by multiplying the sampled small gradient data by (1-a)/b, so that the algorithm pays more attention to the under-trained instances and does not change the distribution of the original data set too much.

The theoretical analysis

GBDT uses decision trees to learn to obtain a function that maps the input space to the gradient space. Suppose there are n instances of the training set, the characteristic dimension is S. In each gradient superposition, the negative gradient direction of the loss function of the model data variable is expressed as, the decision tree distributes data to each node through the optimal segmentation point (maximum information gain point). GBDT measures the information gain by the variance after segmentation.

Definition: O represents the training set of a fixed node, and the segmentation point D of segmentation feature J is defined as:

   

Among them,

Go through each split point of each feature and findThe maximum information gain is calculatedAnd then, the data is based on featuresSplit pointsDivide data into left and right child nodes.

In the GOSS,

  • First, the training is sorted in descending order according to the gradient of data.
  • Top A data instances are reserved as data subset A.
  • For the remaining instances of data, random sampling is used to obtain a subset of data b of size B.
  • Finally, we estimate the information gain by the following equation:

(1)  

Here GOSS estimates the information gain from a smaller data setThe computation will be greatly reduced. More importantly, our next theory shows that GOSS does not lose much training accuracy over random sampling, which is proved in additional material.

We define the GOSS approximation error as:

   

   

The probability is at least, there are:

(2)  

Among them

According to the above theories, we can draw the following conclusions:

  • The asymptotic approximation ratio of GOSS. If data segmentation is not highly imbalanced (e.g), then the approximation error in inequality (2) will be dominated by the second term, as n tends to infinity (large amount of data),Will tend to 0, that is, the larger the amount of data, the smaller the error and the higher the accuracy.
  • Random sampling is a case of GOSS at a=0. GOSS performs better than random sampling in most cases:, i.e.,, including

The generalization of GOSS is analyzed below. Consider the GOSS generalization error, this is the difference between the variance gain calculated by the GOSS sample and the actual sample variance gain. Transformation forTherefore, in the case of GOSS accuracy, the GOSS generalization error approximates the full amount of real data. Sampling, on the other hand, will increase the diversity of the base learner (since the data may be different with each sample), which will improve generalization.

Exclusive Feature Bundling (EFB)

EFB reduces feature dimensions (actually dimension reduction technology) by feature bundling to improve computing efficiency. Usually bundled features are mutually exclusive (one eigenvalue is zero and one is non-zero) so that the two features are bundled without losing information. If the two features are not completely mutually exclusive (in some cases, both features are non-zero), an index can be used to measure the degree of non-exclusive features, called the conflict ratio. When the value is small, we can choose to bundle the two features that are not completely mutually exclusive without affecting the final accuracy.

EBF algorithm steps are as follows:

  • Sort features by the number of non-zero values
  • Calculate the conflict ratio between different features
  • Iterate over each feature and try to merge features to minimize the collision rate

High level data is usually sparse, and this sparsity inspires us to design a nondestructive method to reduce the dimension of features. In particular, in a sparse feature space, many features are mutually exclusive, for example, they are never simultaneously non-zero. We can bind mutually exclusive features into a single feature, and by carefully designing the feature surface algorithm, we construct feature histograms identical to single features from feature binding. In this way, the time complexity of inter-histogram decreases from O(#data * #feature) to O(#data * #bundle). Due to the #bundle << #feature, we can greatly speed up the GBDT training process and lose accuracy.

There are two questions:

  • How do you decide which features should be bundled together?
  • How to merge a feature?

Bundle (What features are bound to)?

** Theory 1: it is NP hard for ** to divide features into a small number of mutually exclusive feature groups.

Proof: the graph coloring problem is reduced to this problem, and graph coloring is NP hard, so this problem is NP hard.

Given graph coloring instance G=(V, E). In G the incidence matrix of each behavior characteristics, an instance of our problems have | | V a characteristics. It is easy to see that in our problem, a unique feature pack corresponds to a set of vertices with the same color, and vice versa.

Theory 1 states that it is infeasible to solve this NP hard problem in polynomial time. In order to find a good approximation algorithm, we reduce the optimal binding problem to graph coloring problem. If two features are not mutually exclusive, then we connect them with an edge, and then use a reasonable greedy algorithm (with constant approximation ratio) for graph coloring to do feature binding. In addition, we note that there are often many features, although not 100% mutually exclusive, and few that take non-zero values at the same time. If our algorithm can allow a small number of conflicts, we can get fewer feature packets and further improve the computational efficiency. After simple calculation, random contamination of a small number of eigenvalues will affect the accuracy of the most.Is the maximum collision ratio in each binding, which, when relatively small, achieves a balance between accuracy and efficiency.

** Algorithm 3: ** Based on the above discussion, we designed algorithm 3, the pseudo-code is shown in the figure below, the specific algorithm:

  • Build a graph, each point represents features, each edge has weights, and the weights are related to the overall conflict between features.
  • Sort features by degree in descending order.
  • Check each feature after sorting, and bind it to the feature or create a new binding to minimize the overall conflict after the operation.

The time complexity of algorithm 3 is, only once before training, its time complexity is acceptable in the case of not many features, but it is difficult to deal with the characteristics of one million dimensions. To continue to improve efficiency, we propose a more efficient graphless sorting strategy: sorting features according to the number of non-zero values, which is similar to degree sorting using graph nodes, because more non-zero values usually lead to conflicts. The new algorithm changes the sorting strategy based on Algorithm 3.

Merging features

How to combine features of the same bundle to reduce training time complexity. The key is that the original eigenvalues can be distinguished from the bundle. Since the histogram algorithm stores discrete values rather than continuous eigenvalues, we build bundles by placing mutually exclusive features in separate boxes. This can be done by adding offsets to the original value of the feature. For example, suppose there are two features in the bundle. The original feature A takes on the value [0, 10] and the original feature B takes on the value [0, 20]. We add offset 10 to B, so B takes the value [10, 30]. By doing this, you can safely combine A and B features and replace AB with A feature with A value of [0, 30]. Algorithm see Algorithm 4,

EFB algorithm can change many mutually exclusive features into low-dimensional dense features, which can effectively avoid the calculation of unnecessary zero-value features. In practice, the basic histogram algorithm is optimized by ignoring the zero-value feature by recording the non-zero values in the data in a table. By scanning the data in the table, the time complexity to build the histogram is reduced from O(#data) to O(#non_zero_data). Of course, this approach requires additional memory and computational overhead to maintain pre-feature tables during tree building. We used this optimization as a base function in lightGBM because this optimization does not conflict with EFBs (it can be used for EFBs) when bundles are sparse.

Reference links: blog.csdn.net/shine199308…

How do I use LightGBM

All the parameters of the meaning, the reference: http://lightgbm.apachecn.org/#/docs/6

Parameter adjustment process:

  • Num_leaves. LightGBM uses a Leaf-wise algorithm, so num_leaves instead of max_depth are used to adjust the complexity of the tree. General conversion relationship:
  • Param [‘ is_unbalance ‘]= ‘true’
  • Bagging parameters: bagging_fraction+bagging_freq (you must set bagging_freq at the same time) and feature_fraction. Bagging_fraction makes bagging run faster, and Feature_fraction sets the percentage of features to use in each iteration
  • Min_data_in_leaf, min_SUM_hessian_IN_leaf: To prevent overfitting, increase its value, which is usually set to a larger value.

For more information, see: www.huaxiaozhuan.com/%E5%B7%A5%E…

An example of LightGBM in the form of the Sklearn interface

import lightgbm as lgb from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV From sklearn.datasets import load_iris from sklearn.model_selection import train_test_split # Load data iris = load_iris() data = iris.data target = iris.target X_train, X_test, y_train, y_test = train_test_split(data, target, Test_size =0.2) # Regressor(Objective ='regression', NUM_leaves =31, learning_rate=0.05, n_estimators=20) gbm.fit(X_train, y_train, eval_set=[(X_test, y_test)], eval_metric='l1', Early_stopping_rounds =5) # predict(X_test, Num_iteration = GBm.best_iteration_) # model evaluation print('The RMse of prediction is:', mean_squared_error(y_test, ** 0.5) # feature importances print(' feature importances:', list(gbm.feature_importances_)) Parameter regressor (NUM_leaves =31) Param_grid = {'learning_rate': [0.01, 0.1, 1], 'N_ESTIMators ': [20, 40] } gbm = GridSearchCV(estimator, param_grid) gbm.fit(X_train, y_train) print('Best parameters found by grid search are:', gbm.best_params_)Copy the code

The native form uses LightgBM

import lightgbm as lgb from sklearn.metrics import mean_squared_error from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split iris = load_iris() data = iris.data target = iris.target X_train, X_test, y_train, y_test = train_test_split(data, target, Lgb_train = LGB.Dataset(X_train, y_train) lGB_eval = LGB.Dataset(X_test, y_test, Params = {'task': 'train', 'boosting_type': 'GBDT ', # set promotion type' objective': Regression ', # objective function 'metric': {'l2', 'auc'}, # evaluation function 'num_leaves': 31, # leaves' learning_rate': 5, # learning rate 'feature_fraction': 0.9, # building tree feature selection ratio 'bagging_fraction': 0.8, # building tree sample sampling ratio 'bagging_freq': 5, # k means bagging 'verbose' every k iterations: If (params, lGB_train, num_boost_round, params, lGB_round, num_boost_round, params, lGB_round, num_boost_round, params, lGB_round, num_boost_round), valid_sets=lgb_eval, Save_model ('model.txt') # predict = gbm.predict(X_test, Num_iteration = GBm.best_Iteration) # Evaluation model print('The RMse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)Copy the code

Parameters of the quick

xgb lgb xgb.sklearn lgb.sklearn
Booster gbtree = ‘ ‘ Boosting GBDT = ‘ ‘ Booster gbtree = ‘ ‘ GBDT boosting_type = ‘ ‘
Objective = ‘binary: logistic’ Application = ‘binary’ Objective = ‘binary: logistic’ Objective = ‘binary’
max_depth=7 num_leaves=2**7 max_depth=7 num_leaves=2**7
Eta = 0.1 Learning_rate = 0.1 Learning_rate = 0.1 Learning_rate = 0.1
num_boost_round=10 num_boost_round=10 n_estimators=10 n_estimators=10
gamma=0 Min_split_gain = 0.0 gamma=0 Min_split_gain = 0.0
min_child_weight=5 min_child_weight=5 min_child_weight=5 min_child_weight=5
subsample=1 bagging_fraction=1 Subsample = 1.0 Subsample = 1.0
Colsample_bytree = 1.0 feature_fraction=1 Colsample_bytree = 1.0 Colsample_bytree = 1.0
alpha=0 lambda_l1=0 Reg_alpha = 0.0 Reg_alpha = 0.0
lambda=1 lambda_l2=0 reg_lambda=1 Reg_lambda = 0.0
scale_pos_weight=1 scale_pos_weight=1 scale_pos_weight=1 scale_pos_weight=1
seed bagging_seed feature_fraction_seed random_state=888 random_state=888
nthread num_threads n_jobs=4 n_jobs=4
evals valid_sets eval_set eval_set
eval_metric metric eval_metric eval_metric
early_stopping_rounds early_stopping_rounds early_stopping_rounds early_stopping_rounds
verbose_eval verbose_eval verbose verbose

Reference links: bacterous. Making. IO / 2018/09/13 /…

More reference

  • Project address: github.com/Microsoft/L…
  • English: lightgbm.apachecn.org/
  • Refer to the link: www.zhihu.com/question/26…