今期のサンフレッチェ広島の観客動員数を機械学習で予測した その2(XGBoostのチューニング含む)
今週末にもJリーグが開幕するので、今週発表された週間天気を基にXGBoostのアルゴリズムを使って再度今期の観客を予測しました。今回はKaggleのHouse PricesのコンペをやっていくうちにXGBoostの使い方を覚えたので、XGBoostのチューニングの仕方も併せて乗せてみます。
今週末の天気
今週末は晴れで気温が11度の予想です。
XGBoostとは
Kaggleのようなデータ分析コンペで実績を残しているアルゴリズムです。 具体的にどのような動きをするかというのはこういうところが参考になります。 今でこそ目的関数は分かりますがMachine Learningを取る前だとそもそも何かわからなかったかも。
パッケージユーザーのための機械学習(12):Xgboost (eXtreme Gradient Boosting) - 六本木で働くデータサイエンティストのブログ XGBoostやパラメータチューニングの仕方に関する調査 | かものはしの分析ブログ
XGBoostのチューニング
チューニングの元ネタはこちらを参考にしました。
Analytics_Vidhya/XGBoost models.ipynb at master · aarshayj/Analytics_Vidhya · GitHub
実際にチューニングした結果はこちらとなります。 まずはライブラリの読み込みと、テーブルの読み込みです。 テーブルはこちらで作成したものを使用します。
今期のサンフレッチェ広島の観客動員数を機械学習で予測した - Masser’s Blog
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np from scipy.stats import norm from sklearn.preprocessing import StandardScaler from scipy import stats %matplotlib inline train = pd.read_csv('sanftrain.csv') test = pd.read_csv('sanftest.csv') train_ID = train['Key'] test_ID = test['Key'] y_train = train['Audience'] X_train = train.drop(['Key','Audience'], axis=1) X_test = test.drop('Key', axis=1)
まずは、max_depthと、n_estimatorsをチューニングします。
import xgboost as xgb from sklearn.model_selection import GridSearchCV xgb_model = xgb.XGBRegressor(max_depth=5,min_child_weight=1,gamma=0,subsample=0.2,colsample_bytree=0.2,learning_rate=0.1) reg = GridSearchCV(xgb_model, {'max_depth': [2,3,4,5,6], 'n_estimators': [50,75,100,125,150,175,200,225,250,275,300]}, verbose=1) reg.fit(X_train, y_train)
結果はこちらで分かります。これからmax_depthは4, n_estimatorsは300だと分かります。
reg.grid_scores_,reg.best_params_, reg.best_score_ ([mean: 0.30059, std: 0.01853, params: {'max_depth': 2, 'n_estimators': 50}, mean: 0.39571, std: 0.00777, params: {'max_depth': 2, 'n_estimators': 75}, mean: 0.44242, std: 0.01296, params: {'max_depth': 2, 'n_estimators': 100}, mean: 0.47352, std: 0.02664, params: {'max_depth': 2, 'n_estimators': 125}, mean: 0.49547, std: 0.01577, params: {'max_depth': 2, 'n_estimators': 150}, mean: 0.54260, std: 0.02425, params: {'max_depth': 2, 'n_estimators': 175}, mean: 0.54957, std: 0.03926, params: {'max_depth': 2, 'n_estimators': 200}, mean: 0.52969, std: 0.04004, params: {'max_depth': 2, 'n_estimators': 225}, mean: 0.54029, std: 0.05254, params: {'max_depth': 2, 'n_estimators': 250}, mean: 0.54485, std: 0.04447, params: {'max_depth': 2, 'n_estimators': 275}, mean: 0.55701, std: 0.04541, params: {'max_depth': 2, 'n_estimators': 300}, mean: 0.29411, std: 0.02525, params: {'max_depth': 3, 'n_estimators': 50}, mean: 0.39210, std: 0.00699, params: {'max_depth': 3, 'n_estimators': 75}, mean: 0.44110, std: 0.01439, params: {'max_depth': 3, 'n_estimators': 100}, mean: 0.47381, std: 0.02580, params: {'max_depth': 3, 'n_estimators': 125}, mean: 0.49691, std: 0.01668, params: {'max_depth': 3, 'n_estimators': 150}, mean: 0.54583, std: 0.02631, params: {'max_depth': 3, 'n_estimators': 175}, mean: 0.55439, std: 0.04097, params: {'max_depth': 3, 'n_estimators': 200}, mean: 0.53532, std: 0.04101, params: {'max_depth': 3, 'n_estimators': 225}, mean: 0.54449, std: 0.05164, params: {'max_depth': 3, 'n_estimators': 250}, mean: 0.54941, std: 0.04269, params: {'max_depth': 3, 'n_estimators': 275}, mean: 0.56011, std: 0.04418, params: {'max_depth': 3, 'n_estimators': 300}, mean: 0.29457, std: 0.02471, params: {'max_depth': 4, 'n_estimators': 50}, mean: 0.39260, std: 0.00701, params: {'max_depth': 4, 'n_estimators': 75}, mean: 0.44148, std: 0.01402, params: {'max_depth': 4, 'n_estimators': 100}, mean: 0.47413, std: 0.02567, params: {'max_depth': 4, 'n_estimators': 125}, mean: 0.49719, std: 0.01653, params: {'max_depth': 4, 'n_estimators': 150}, mean: 0.54507, std: 0.02510, params: {'max_depth': 4, 'n_estimators': 175}, mean: 0.55360, std: 0.03993, params: {'max_depth': 4, 'n_estimators': 200}, mean: 0.53499, std: 0.04068, params: {'max_depth': 4, 'n_estimators': 225}, mean: 0.54440, std: 0.05068, params: {'max_depth': 4, 'n_estimators': 250}, mean: 0.54975, std: 0.04196, params: {'max_depth': 4, 'n_estimators': 275}, mean: 0.56034, std: 0.04323, params: {'max_depth': 4, 'n_estimators': 300}, mean: 0.29457, std: 0.02471, params: {'max_depth': 5, 'n_estimators': 50}, mean: 0.39260, std: 0.00701, params: {'max_depth': 5, 'n_estimators': 75}, mean: 0.44148, std: 0.01402, params: {'max_depth': 5, 'n_estimators': 100}, mean: 0.47413, std: 0.02567, params: {'max_depth': 5, 'n_estimators': 125}, mean: 0.49719, std: 0.01653, params: {'max_depth': 5, 'n_estimators': 150}, mean: 0.54507, std: 0.02510, params: {'max_depth': 5, 'n_estimators': 175}, mean: 0.55360, std: 0.03993, params: {'max_depth': 5, 'n_estimators': 200}, mean: 0.53499, std: 0.04068, params: {'max_depth': 5, 'n_estimators': 225}, mean: 0.54440, std: 0.05068, params: {'max_depth': 5, 'n_estimators': 250}, mean: 0.54975, std: 0.04196, params: {'max_depth': 5, 'n_estimators': 275}, mean: 0.56034, std: 0.04323, params: {'max_depth': 5, 'n_estimators': 300}, mean: 0.29457, std: 0.02471, params: {'max_depth': 6, 'n_estimators': 50}, mean: 0.39260, std: 0.00701, params: {'max_depth': 6, 'n_estimators': 75}, mean: 0.44148, std: 0.01402, params: {'max_depth': 6, 'n_estimators': 100}, mean: 0.47413, std: 0.02567, params: {'max_depth': 6, 'n_estimators': 125}, mean: 0.49719, std: 0.01653, params: {'max_depth': 6, 'n_estimators': 150}, mean: 0.54507, std: 0.02510, params: {'max_depth': 6, 'n_estimators': 175}, mean: 0.55360, std: 0.03993, params: {'max_depth': 6, 'n_estimators': 200}, mean: 0.53499, std: 0.04068, params: {'max_depth': 6, 'n_estimators': 225}, mean: 0.54440, std: 0.05068, params: {'max_depth': 6, 'n_estimators': 250}, mean: 0.54975, std: 0.04196, params: {'max_depth': 6, 'n_estimators': 275}, mean: 0.56034, std: 0.04323, params: {'max_depth': 6, 'n_estimators': 300}], {'max_depth': 4, 'n_estimators': 300}, 0.5603390632246239)
次に、gammaをチューニングします。
import xgboost as xgb from sklearn.model_selection import GridSearchCV xgb_model = xgb.XGBRegressor(max_depth=4,n_estimators=300,min_child_weight=1,gamma=0,subsample=0.2,colsample_bytree=0.2,learning_rate=0.1) reg = GridSearchCV(xgb_model, {'gamma':[i/10.0 for i in range(0,5)]}, verbose=1) reg.fit(X_train, y_train)
結果はこちらで分かります。これからgammaは0.0だと分かります。
reg.grid_scores_,reg.best_params_, reg.best_score_ ([mean: 0.56034, std: 0.04323, params: {'gamma': 0.0}, mean: 0.56034, std: 0.04323, params: {'gamma': 0.1}, mean: 0.56034, std: 0.04323, params: {'gamma': 0.2}, mean: 0.56034, std: 0.04323, params: {'gamma': 0.3}, mean: 0.56034, std: 0.04323, params: {'gamma': 0.4}], {'gamma': 0.0}, 0.5603390632246239)
次に、subsampleとcolsample_bytreeをチューニングします。
import xgboost as xgb from sklearn.model_selection import GridSearchCV xgb_model = xgb.XGBRegressor(max_depth=2,n_estimators=100,min_child_weight=1,gamma=0,subsample=0.2,colsample_bytree=0.2,learning_rate=0.1) reg = GridSearchCV(xgb_model, {'subsample':[i/10.0 for i in range(4,10)], 'colsample_bytree':[i/10.0 for i in range(4,10)]}, verbose=1) reg.fit(X_train, y_train)
結果はこちらで分かります。これからsubsampleとcolsample_bytreeはどちらも0.7だと分かります。
reg.grid_scores_,reg.best_params_, reg.best_score_ ([mean: 0.55839, std: 0.05922, params: {'colsample_bytree': 0.4, 'subsample': 0.4}, mean: 0.55703, std: 0.08538, params: {'colsample_bytree': 0.4, 'subsample': 0.5}, mean: 0.55295, std: 0.09469, params: {'colsample_bytree': 0.4, 'subsample': 0.6}, mean: 0.54332, std: 0.07073, params: {'colsample_bytree': 0.4, 'subsample': 0.7}, mean: 0.54041, std: 0.06807, params: {'colsample_bytree': 0.4, 'subsample': 0.8}, mean: 0.55073, std: 0.06947, params: {'colsample_bytree': 0.4, 'subsample': 0.9}, mean: 0.56410, std: 0.06286, params: {'colsample_bytree': 0.5, 'subsample': 0.4}, mean: 0.54650, std: 0.08238, params: {'colsample_bytree': 0.5, 'subsample': 0.5}, mean: 0.55583, std: 0.09343, params: {'colsample_bytree': 0.5, 'subsample': 0.6}, mean: 0.55248, std: 0.09086, params: {'colsample_bytree': 0.5, 'subsample': 0.7}, mean: 0.54473, std: 0.08220, params: {'colsample_bytree': 0.5, 'subsample': 0.8}, mean: 0.55531, std: 0.08295, params: {'colsample_bytree': 0.5, 'subsample': 0.9}, mean: 0.56410, std: 0.06286, params: {'colsample_bytree': 0.6, 'subsample': 0.4}, mean: 0.54650, std: 0.08238, params: {'colsample_bytree': 0.6, 'subsample': 0.5}, mean: 0.55583, std: 0.09343, params: {'colsample_bytree': 0.6, 'subsample': 0.6}, mean: 0.55248, std: 0.09086, params: {'colsample_bytree': 0.6, 'subsample': 0.7}, mean: 0.54473, std: 0.08220, params: {'colsample_bytree': 0.6, 'subsample': 0.8}, mean: 0.55531, std: 0.08295, params: {'colsample_bytree': 0.6, 'subsample': 0.9}, mean: 0.55025, std: 0.07755, params: {'colsample_bytree': 0.7, 'subsample': 0.4}, mean: 0.54751, std: 0.09015, params: {'colsample_bytree': 0.7, 'subsample': 0.5}, mean: 0.55105, std: 0.09073, params: {'colsample_bytree': 0.7, 'subsample': 0.6}, mean: 0.56454, std: 0.09483, params: {'colsample_bytree': 0.7, 'subsample': 0.7}, mean: 0.54205, std: 0.10489, params: {'colsample_bytree': 0.7, 'subsample': 0.8}, mean: 0.53911, std: 0.09839, params: {'colsample_bytree': 0.7, 'subsample': 0.9}, mean: 0.55933, std: 0.07821, params: {'colsample_bytree': 0.8, 'subsample': 0.4}, mean: 0.54618, std: 0.11937, params: {'colsample_bytree': 0.8, 'subsample': 0.5}, mean: 0.55731, std: 0.11517, params: {'colsample_bytree': 0.8, 'subsample': 0.6}, mean: 0.55182, std: 0.11408, params: {'colsample_bytree': 0.8, 'subsample': 0.7}, mean: 0.55087, std: 0.09362, params: {'colsample_bytree': 0.8, 'subsample': 0.8}, mean: 0.53608, std: 0.10538, params: {'colsample_bytree': 0.8, 'subsample': 0.9}, mean: 0.55311, std: 0.09649, params: {'colsample_bytree': 0.9, 'subsample': 0.4}, mean: 0.53675, std: 0.12155, params: {'colsample_bytree': 0.9, 'subsample': 0.5}, mean: 0.56001, std: 0.10962, params: {'colsample_bytree': 0.9, 'subsample': 0.6}, mean: 0.55277, std: 0.10686, params: {'colsample_bytree': 0.9, 'subsample': 0.7}, mean: 0.55744, std: 0.09592, params: {'colsample_bytree': 0.9, 'subsample': 0.8}, mean: 0.52927, std: 0.09682, params: {'colsample_bytree': 0.9, 'subsample': 0.9}], {'colsample_bytree': 0.7, 'subsample': 0.7}, 0.56453621270808296)
最後に、reg_alphaをチューニングします。
import xgboost as xgb from sklearn.model_selection import GridSearchCV xgb_model = xgb.XGBRegressor(max_depth=2,n_estimators=100,min_child_weight=1,gamma=0,subsample=0.7,colsample_bytree=0.7,learning_rate=0.1) reg = GridSearchCV(xgb_model, {'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]}, verbose=1) reg.fit(X_train, y_train)
結果はreg_alphaは1となりました。
reg.grid_scores_,reg.best_params_, reg.best_score_ ([mean: 0.56454, std: 0.09483, params: {'reg_alpha': 1e-05}, mean: 0.56454, std: 0.09483, params: {'reg_alpha': 0.01}, mean: 0.56454, std: 0.09483, params: {'reg_alpha': 0.1}, mean: 0.56522, std: 0.09479, params: {'reg_alpha': 1}, mean: 0.56443, std: 0.10188, params: {'reg_alpha': 100}], {'reg_alpha': 1}, 0.56521892471311341)
これらのチューニングを元にして予測します。
reg = xgb.XGBRegressor(max_depth=4,n_estimators=300,min_child_weight=1,gamma=0,subsample=0.7,colsample_bytree=0.7,learning_rate=0.1,reg_alpha=1) reg.fit(X_train, y_train)
結果は17645人となりました。ちょっと多いかな?実際の観客がどれくらいになるか楽しみです。