Boosting (Gradient-Boosting, XGBoost, LightGBM)

2019.08.09 07:32

Gradient-Boosting

▶︎ reference

　・How to explain gradient boosting

　・Gradient Boosting Explained

　・Gradient Boosting (Wikipedia)

▶︎ Parameter tuning in Gradient Boosting

　・Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM)

XGBoost

　XGBoost can do:

　・Automatic missing value treatment

　・Automatic Sparse Data Optimization

　・Extendibility : Customized Objective Function

　・Find important features

　・Faster Training Speed via Parallel Training

　・Out of Core Computation

　XGBoost can not do:

　・Feature engineering

　・Hyper parameter tuning

▶︎ reference

　・A Beginner’s guide to XGBoost

　・A Gentle Introduction to XGBoost for Applied Machine Learning

▶︎ Parameter tuning in XGBoost

　・Complete Guide to Parameter Tuning in XGBoost with codes in Python

LightGBM

▶ reference

▶ 注意点：

過学習しやすいので、データ量は多いほうがいい。（ブログでは、10,000件以上のデータに対して使ったほうがいいとアドバイスしている。）
train前に、データをLightGBM dataset formatに変換必要

　　例）

import lightgbm as lgb

　　　　# LightGBMデータに変換。eval_dataは、train時のvalidation時に利用

　　　　lgb_train = lgb.Dataset(x_train, label=y_train)

　　　　lgb_eval = lgb.Dataset(x_test, label=y_test, reference=d_train)

　　　　# train（データ数少なければ、valid_setsはセットしないでtrainする。）

clf = lgb.train(params, lgb_train, 100, valid_sets=(lgb_train,lgb_eval))

▶ 基本的なパラメーター：　〜参考サイトより

　（パラメーターは100種類以上あるが、以下は基本的なパラメーター）

詳細はドキュメントサイト参照した方が良い！

Control Parameters

max_depth: It describes the maximum depth of tree. This parameter is used to handle model overfitting. Any time you feel that your model is overfitted, my first advice will be to lower max_depth.

min_data_in_leaf: It is the minimum number of the records a leaf may have. The default value is 20, optimum value. It is also used to deal over fitting

feature_fraction: LightGBM will randomly select part of features on each iteration (tree) if feature_fraction smaller than 1.0. For example, if you set it to 0.8, LightGBM will select 80% of features before training each tree

bagging_fraction: specifies the fraction of data to be used for each iteration and is generally used to speed up the training and avoid overfitting.

early_stopping_round: This parameter can help you speed up your analysis. Model will stop training if one metric of one validation data doesn’t improve in last early_stopping_round rounds. This will reduce excessive iterations.

lambda: lambda specifies regularization. Typical value ranges from 0 to 1.

min_gain_to_split: This parameter will describe the minimum gain to make a split. It can used to control number of useful splits in tree.

max_cat_group: When the number of category is large, finding the split point on it is easily over-fitting. So LightGBM merges them into ‘max_cat_group’ groups, and finds the split points on the group boundaries, default:64

Core Parameters

Task: It specifies the task you want to perform on data. It may be either train or predict.

application: This is the most important parameter and specifies the application of your model, whether it is a regression problem or classification problem. LightGBM will by default consider model as a regression model.

regression: for regression
binary: for binary classification
multiclass: for multiclass classification problem

boosting: defines the type of algorithm you want to run, default=gdbt

gbdt: traditional Gradient Boosting Decision Tree
rf: random forest
dart: Dropouts meet Multiple Additive Regression Trees
goss: Gradient-based One-Side Sampling

num_boost_round: Number of boosting iterations, typically 100+（※木の数）

learning_rate: This determines the impact of each tree on the final outcome. GBM works by starting with an initial estimate which is updated using the output of each tree. The learning parameter controls the magnitude of this change in the estimates. Typical values: 0.1, 0.001, 0.003…

num_leaves: number of leaves in full tree, default: 31

device: default: cpu, can also pass gpu

Metric parameter

metric: again one of the important parameter as it specifies loss for model building. Below are few general losses for regression and classification.

mae: mean absolute error
mse: mean squared error
binary_logloss: loss for binary classification
multi_logloss: loss for multi classification

IO parameter

max_bin: it denotes the maximum number of bin that feature value will bucket in.

categorical_feature: It denotes the index of categorical features. If categorical_features=0,1,2 then column 0, column 1 and column 2 are categorical variables.

ignore_column: same as categorical_features just instead of considering specific columns as categorical, it will completely ignore them.

save_binary: If you are really dealing with the memory size of your data file then specify this parameter as ‘True’. Specifying parameter true will save the dataset to binary file, this binary file will speed your data reading time for the next time.

▶ チューニングポイント

For Faster Speed:

Use bagging by setting bagging_fraction and bagging_freq
Use feature sub-sampling by setting feature_fraction
Use small max_bin
Use save_binary to speed up data loading in future learning
Use parallel learning

For better accuracy:

Use large max_bin (may be slower)
Use small learning_rate with large num_iterations
Use large num_leaves(may cause over-fitting)
Use bigger training data
Try dart
Try to use categorical feature directly

To deal with over-fitting:

Use small max_bin
Use small num_leaves
Use min_data_in_leaf and min_sum_hessian_in_leaf
Use bagging by set bagging_fraction and bagging_freq
Use feature sub-sampling by set feature_fraction
Use bigger training data
Try lambda_l1, lambda_l2 and min_gain_to_split to regularization
Try max_depth to avoid growing deep tree

※パラメーターチューニングはoptunaで自動化したほうが楽そう・・・

▶ optunaの使い方

機械学習Tips保管庫

データ解析、機械学習のための学習内容の保管庫。復習用。

0コメント

1000 / 1000