Gradient-Boosting
▶︎ reference
・How to explain gradient boosting
・Gradient Boosting (Wikipedia)
▶︎ Parameter tuning in Gradient Boosting
・Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM)
XGBoost
XGBoost can do:
・Automatic missing value treatment
・Automatic Sparse Data Optimization
・Extendibility : Customized Objective Function
・Find important features
・Faster Training Speed via Parallel Training
・Out of Core Computation
XGBoost can not do:
・Feature engineering
・Hyper parameter tuning
▶︎ reference
・A Beginner’s guide to XGBoost
・A Gentle Introduction to XGBoost for Applied Machine Learning
▶︎ Parameter tuning in XGBoost
・Complete Guide to Parameter Tuning in XGBoost with codes in Python
LightGBM
▶ reference
▶ 注意点:
- 過学習しやすいので、データ量は多いほうがいい。(ブログでは、10,000件以上のデータに対して使ったほうがいいとアドバイスしている。)
- train前に、データをLightGBM dataset formatに変換必要
例)
import lightgbm as lgb
# LightGBMデータに変換。eval_dataは、train時のvalidation時に利用
lgb_train = lgb.Dataset(x_train, label=y_train)
lgb_eval = lgb.Dataset(x_test, label=y_test, reference=d_train)
# train(データ数少なければ、valid_setsはセットしないでtrainする。)
clf = lgb.train(params, lgb_train, 100, valid_sets=(lgb_train,lgb_eval))
▶ 基本的なパラメーター: 〜参考サイトより
(パラメーターは100種類以上あるが、以下は基本的なパラメーター)
詳細はドキュメントサイト参照した方が良い!
Control Parameters
- max_depth: It describes the maximum depth of tree. This parameter is used to handle model overfitting. Any time you feel that your model is overfitted, my first advice will be to lower max_depth.
- min_data_in_leaf: It is the minimum number of the records a leaf may have. The default value is 20, optimum value. It is also used to deal over fitting
- feature_fraction: LightGBM will randomly select part of features on each iteration (tree) if feature_fraction smaller than 1.0. For example, if you set it to 0.8, LightGBM will select 80% of features before training each tree
- bagging_fraction: specifies the fraction of data to be used for each iteration and is generally used to speed up the training and avoid overfitting.
- early_stopping_round: This parameter can help you speed up your analysis. Model will stop training if one metric of one validation data doesn’t improve in last early_stopping_round rounds. This will reduce excessive iterations.
- lambda: lambda specifies regularization. Typical value ranges from 0 to 1.
- min_gain_to_split: This parameter will describe the minimum gain to make a split. It can used to control number of useful splits in tree.
- max_cat_group: When the number of category is large, finding the split point on it is easily over-fitting. So LightGBM merges them into ‘max_cat_group’ groups, and finds the split points on the group boundaries, default:64
Core Parameters
- Task: It specifies the task you want to perform on data. It may be either train or predict.
- application: This is the most important parameter and specifies the application of your model, whether it is a regression problem or classification problem. LightGBM will by default consider model as a regression model.
- regression: for regression
- binary: for binary classification
- multiclass: for multiclass classification problem
- boosting: defines the type of algorithm you want to run, default=gdbt
- gbdt: traditional Gradient Boosting Decision Tree
- rf: random forest
- dart: Dropouts meet Multiple Additive Regression Trees
- goss: Gradient-based One-Side Sampling
- num_boost_round: Number of boosting iterations, typically 100+(※木の数)
- learning_rate: This determines the impact of each tree on the final outcome. GBM works by starting with an initial estimate which is updated using the output of each tree. The learning parameter controls the magnitude of this change in the estimates. Typical values: 0.1, 0.001, 0.003…
- num_leaves: number of leaves in full tree, default: 31
- device: default: cpu, can also pass gpu
Metric parameter
- metric: again one of the important parameter as it specifies loss for model building. Below are few general losses for regression and classification.
- mae: mean absolute error
- mse: mean squared error
- binary_logloss: loss for binary classification
- multi_logloss: loss for multi classification
IO parameter
- max_bin: it denotes the maximum number of bin that feature value will bucket in.
- categorical_feature: It denotes the index of categorical features. If categorical_features=0,1,2 then column 0, column 1 and column 2 are categorical variables.
- ignore_column: same as categorical_features just instead of considering specific columns as categorical, it will completely ignore them.
- save_binary: If you are really dealing with the memory size of your data file then specify this parameter as ‘True’. Specifying parameter true will save the dataset to binary file, this binary file will speed your data reading time for the next time.
▶ チューニングポイント
For Faster Speed:
- Use bagging by setting bagging_fraction and bagging_freq
- Use feature sub-sampling by setting feature_fraction
- Use small max_bin
- Use save_binary to speed up data loading in future learning
- Use parallel learning
For better accuracy:
- Use large max_bin (may be slower)
- Use small learning_rate with large num_iterations
- Use large num_leaves(may cause over-fitting)
- Use bigger training data
- Try dart
- Try to use categorical feature directly
To deal with over-fitting:
- Use small max_bin
- Use small num_leaves
- Use min_data_in_leaf and min_sum_hessian_in_leaf
- Use bagging by set bagging_fraction and bagging_freq
- Use feature sub-sampling by set feature_fraction
- Use bigger training data
- Try lambda_l1, lambda_l2 and min_gain_to_split to regularization
- Try max_depth to avoid growing deep tree
0コメント